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Preface 


This book originated from a set of lecture notes for a one-quarter graduate- 
level course taught at the University of Washington. The purpose of the course 
is to familiarize the students with the basic concepts of Bayesian theory and 
to quickly get them performing their own data analyses using Bayesian com- 
putational tools. The audience for this course includes non-statistics graduate 
students who did well in their department’s graduate-level introductory statis- 
tics courses and who also have an interest in statistics. Additionally, first- and 
second-year statistics graduate students have found this course to be a useful 
introduction to statistical modeling. Like the course, this book is intended to 
be a self-contained and compact introduction to the main concepts of Bayesian 
theory and practice. By the end of the text, readers should have the ability to 
understand and implement the basic tools of Bayesian statistical methods for 
their own data analysis purposes. The text is not intended as a comprehen- 
sive handbook for advanced statistical researchers, although it is hoped that 
this latter category of readers could use this book as a quick introduction to 
Bayesian methods and as a preparation for more comprehensive and detailed 
studies. 


Computing 


Monte Carlo summaries of posterior distributions play an important role in 
the way data analyses are presented in this text. My experience has been 
that once a student understands the basic idea of posterior sampling, their 
data analyses quickly become more creative and meaningful, using relevant 
posterior predictive distributions and interesting functions of parameters. The 
open-source R statistical computing environment provides sufficient function- 
ality to make Monte Carlo estimation very easy for a large number of statis- 
tical models, and example R-code is provided throughout the text. Much of 
the example code can be run “as is” in R, and essentially all of it can be run 
after downloading the relevant datasets from the companion website for this 
book. 
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Introduction and examples 


1.1 Introduction 


We often use probabilities informally to express our information and beliefs 
about unknown quantities. However, the use of probabilities to express infor- 
mation can be made formal: In a precise mathematical sense, it can be shown 
that probabilities can numerically represent a set of rational beliefs, that there 
is a relationship between probability and information, and that Bayes’ rule 
provides a rational method for updating beliefs in light of new information. 
The process of inductive learning via Bayes’ rule is referred to as Bayesian 
inference. 

More generally, Bayesian methods are data analysis tools that are derived 
from the principles of Bayesian inference. In addition to their formal interpre- 
tation as a means of induction, Bayesian methods provide: 


parameter estimates with good statistical properties; 

parsimonious descriptions of observed data; 

predictions for missing data and forecasts of future data; 

a computational framework for model estimation, selection and validation. 


Thus the uses of Bayesian methods go beyond the formal task of induction 
for which the methods are derived. Throughout this book we will explore 
the broad uses of Bayesian methods for a variety of inferential and statistical 
tasks. We begin in this chapter with an introduction to the basic ingredients 
of Bayesian learning, followed by some examples of the different ways in which 
Bayesian methods are used in practice. 


Bayesian learning 


Statistical induction is the process of learning about the general characteristics 
of a population from a subset of members of that population. Numerical values 
of population characteristics are typically expressed in terms of a parameter 6, 
and numerical descriptions of the subset make up a dataset y. Before a dataset 
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is obtained, the numerical values of both the population characteristics and the 
dataset are uncertain. After a dataset y is obtained, the information it contains 
can be used to decrease our uncertainty about the population characteristics. 
Quantifying this change in uncertainty is the purpose of Bayesian inference. 

The sample space yY is the set of all possible datasets, from which a single 
dataset y will result. The parameter space O is the set of possible parameter 
values, from which we hope to identify the value that best represents the true 
population characteristics. The idealized form of Bayesian learning begins with 
a numerical formulation of joint beliefs about y and 0, expressed in terms of 
probability distributions over y and ©. 


1. For each numerical value 6 € O, our prior distribution p(@) describes our 
belief that 6 represents the true population characteristics. 

2. For each 6 € O and y € YV, our sampling model p(y|@) describes our belief 
that y would be the outcome of our study if we knew @ to be true. 


Once we obtain the data y, the last step is to update our beliefs about 0: 


3. For each numerical value of 0 € O, our posterior distribution p(0|y) de- 
scribes our belief that 0 is the true value, having observed dataset y. 


The posterior distribution is obtained from the prior distribution and sampling 
model via Bayes’ rule: 


It is important to note that Bayes’ rule does not tell us what our beliefs should 
be, it tells us how they should change after seeing new information. 


1.2 Why Bayes? 


Mathematical results of Cox (1946, 1961) and Savage (1954, 1972) prove that 
if p(0) and p(y|0) represent a rational person’s beliefs, then Bayes’ rule is an 
optimal method of updating this person’s beliefs about 0 given new infor- 
mation y. These results give a strong theoretical justification for the use of 
Bayes’ rule as a method of quantitative learning. However, in practical data 
analysis situations it can be hard to precisely mathematically formulate what 
our prior beliefs are, and so p(@) is often chosen in a somewhat ad hoc manner 
or for reasons of computational convenience. What then is the justification of 
Bayesian data analysis? 

A famous quote about sampling models is that “all models are wrong, 
but some are useful” (Box and Draper, 1987, pg. 424). Similarly, p(@) might 
be viewed as “wrong” if it does not accurately represent our prior beliefs. 
However, this does not mean that p(6|y) is not useful. If p(@) approximates our 
beliefs, then the fact that p(@|y) is optimal under p(@) means that it will also 
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generally serve as a good approximation to what our posterior beliefs should 
be. In other situations it may not be our beliefs that are of interest. Rather, 
we may want to use Bayes’ rule to explore how the data would update the 
beliefs of a variety of people with differing prior opinions. Of particular interest 
might be the posterior beliefs of someone with weak prior information. This 
has motivated the use of “diffuse” prior distributions, which assign probability 
more or less evenly over large regions of the parameter space. 

Finally, in many complicated statistical problems there are no obvious 
non-Bayesian methods of estimation or inference. In these situations, Bayes’ 
rule can be used to generate estimation procedures, and the performance of 
these procedures can be evaluated using non-Bayesian criteria. In many cases 
it has been shown that Bayesian or approximately Bayesian procedures work 
very well, even for non-Bayesian purposes. 

The next two examples are intended to show how Bayesian inference, us- 
ing prior distributions that may only roughly represent our or someone else’s 
prior beliefs, can be broadly useful for statistical inference. Most of the math- 
ematical details of the calculations are left for later chapters. 


1.2.1 Estimating the probability of a rare event 


Suppose we are interested in the prevalence of an infectious disease in a small 
city. The higher the prevalence, the more public health precautions we would 
recommend be put into place. A small random sample of 20 individuals from 
the city will be checked for infection. 


Parameter and sample spaces 


Interest is in 0, the fraction of infected individuals in the city. Roughly speak- 
ing, the parameter space includes all numbers between zero and one. The 
data y records the total number of people in the sample who are infected. 
The parameter and sample spaces are then as follows: 


© = (0, 1] Y = {0,1,..., 20}. 
Sampling model 


Before the sample is obtained the number of infected individuals in the sample 
is unknown. We let the variable Y denote this to-be-determined value. If 
the value of 0 were known, a reasonable sampling model for Y would be a 
binomial(20, 0) probability distribution: 


Y|0 ~ binomial(20, 0). 


The first panel of Figure 1.1 plots the binomial(20, 8) distribution for 6 equal 
to 0.05, 0.10 and 0.20. If, for example, the true infection rate is 0.05, then the 
probability that there will be zero infected individuals in the sample (Y = 0) 
is 36%. If the true rate is 0.10 or 0.20, then the probabilities that Y = 0 are 
12% and 1%, respectively. 
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Fig. 1.1. Sampling model, prior and posterior distributions for the infection rate 
example. The plot on the left-hand side gives binomial(20, 0) distributions for three 
values of 0. The right-hand side gives prior (gray) and posterior (black) densities of 
0. 


Prior distribution 


Other studies from various parts of the country indicate that the infection rate 
in comparable cities ranges from about 0.05 to 0.20, with an average prevalence 
of 0.10. This prior information suggests that we use a prior distribution p(0) 
that assigns a substantial amount of probability to the interval (0.05, 0.20), 
and that the expected value of 6 under p(6) is close to 0.10. However, there are 
infinitely many probability distributions that satisfy these conditions, and it 
is not clear that we can discriminate among them with our limited amount of 
prior information. We will therefore use a prior distribution p(0) that has the 
characteristics described above, but whose particular mathematical form is 
chosen for reasons of computational convenience. Specifically, we will encode 
the prior information using a member of the family of beta distributions. A 
beta distribution has two parameters which we denote as a and b. If 0 has a 
beta(a,b) distribution, then the expectation of 0 is a/(a + b) and the most 
probable value of 0 is (a — 1)/(a — 1 + b — 1). For our problem where 0 is 
the infection rate, we will represent our prior information about 0 with a 
beta(2,20) probability distribution. Symbolically, we write 


0 ~ beta(2, 20). 


This distribution is shown in the gray line in the second panel of Figure 1.1. 
The expected value of 0 for this prior distribution is 0.09. The curve of the 
prior distribution is highest at 0 = 0.05 and about two-thirds of the area 
under the curve occurs between 0.05 and 0.20. The prior probability that the 
infection rate is below 0.10 is 64%. 
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E[0] = 0.09 

mode[6] = 0.05 

Pr(0 < 0.10) = 0.64 
Pr(0.05 < 0 < 0.20) = 0.66. 


Posterior distribution 


As we will see in Chapter 3, if Y|@ ~ binomial(n,@) and 0 ~ beta(a, b), 
then if we observe a numeric value y of Y, the posterior distribution is a 
beta(a+y,b+n-—y) distribution. Suppose that for our study a value of Y = 0 
is observed, i.e. none of the sample individuals are infected. The posterior 
distribution of 0 is then a beta(2,40) distribution. 


OHY = 0} ~ beta(2, 40) 


The density of this distribution is given by the black line in the second panel 
of Figure 1.1. This density is further to the left than the prior distribution, 
and more peaked as well. It is to the left of p(@) because the observation 
that Y = 0 provides evidence of a low value of @. It is more peaked than p(0) 
because it combines information from the data and the prior distribution, and 
thus contains more information than in p(@) alone. The peak of this curve is 
at 0.025 and the posterior expectation of 0 is 0.048. The posterior probability 
that 0 < 0.10 is 93%. 


E[0|Y = 0] = 0.048 
mode[6|Y = 0] = 0.025 
Pr(0 < 0.10|Y = 0) = 0.93. 


The posterior distribution p(0|Y = 0) provides us with a model for learning 
about the city-wide infection rate 0. From a theoretical perspective, a ratio- 
nal individual whose prior beliefs about 0 were represented by a beta(2,20) 
distribution now has beliefs that are represented by a beta(2,40) distribution. 
As a practical matter, if we accept the beta(2,20) distribution as a reasonable 
measure of prior information, then we accept the beta(2,40) distribution as a 
reasonable measure of posterior information. 


Sensitivity analysis 


Suppose we are to discuss the results of the survey with a group of city health 
officials. A discussion of the implications of our study among a diverse group of 
people might benefit from a description of the posterior beliefs corresponding 
to a variety of prior distributions. Suppose we were to consider beliefs rep- 
resented by beta(a,b) distributions for values of (a,b) other than (2,20). As 
mentioned above, if 0 ~ beta(a, b), then given Y = y the posterior distribution 
of 6 is beta(a + y,b + n — y). The posterior expectation is 
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where 0) = a/(a + b) is the prior expectation of 0 and w = a+ b. From 
this formula we see that the posterior expectation is a weighted average of 
the sample mean ğ and the prior expectation 99. In terms of estimating 6, 
ĝo represents our prior guess at the true value of 0 and w represents our 
confidence in this guess, expressed on the same scale as the sample size. If 


Fig. 1.2. Posterior quantities under different beta prior distributions. The left- and 
right-hand panels give contours of E[@|Y = 0] and Pr(@ < 0.10|Y = 0), respectively, 
for a range of prior expectations and levels of confidence. 


someone provides us with a prior guess 6) and a degree of confidence w, then 
we can approximate their prior beliefs about 0 with a beta distribution having 
parameters a = wo and b = w(1—69). Their approximate posterior beliefs are 
then represented with a beta(w0o +y, w(1 — 00) +n — y) distribution. We can 
compute such a posterior distribution for a wide range of 6) and w values to 
perform a sensitivity analysis, an exploration of how posterior information is 
affected by differences in prior opinion. Figure 1.2 explores the effects of 0) and 
w on the posterior distribution via contour plots of two posterior quantities. 
The first plot gives contours of the posterior expectation E/0|Y = 0], and 
the second gives the posterior probabilities Pr(@ < 0.10|Y = 0). This latter 
plot may be of use if, for instance, the city officials would like to recommend 
a vaccine to the general public unless they were reasonably sure that the 
current infection rate was less than 0.10. The plot indicates, for example, that 
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people with weak prior beliefs (low values of w) or low prior expectations are 
generally 90% or more certain that the infection rate is below 0.10. However, 
a high degree of certainty (say 97.5%) is only achieved by people who already 
thought the infection rate was lower than the average of the other cities. 


Comparison to non-Bayesian methods 


A standard estimate of a population proportion 0 is the sample mean y = 
y/n, the fraction of infected people in the sample. For our sample in which 
y = 0 this of course gives an estimate of zero, and so by using y we would be 
estimating that zero people in the city are infected. If we were to report this 
estimate to a group of doctors or health officials we would probably want to 
include the caveat that this estimate is subject to sampling uncertainty. One 
way to describe the sampling uncertainty of an estimate is with a confidence 
interval. A popular 95% confidence interval for a population proportion @ is 
the Wald interval, given by 


Y+1.96/ 91 — ¥)/n. 


This interval has correct asymptotic frequentist coverage, meaning that if n 
is large, then with probability approximately equal to 95%, Y will take on 
a value y such that the above interval contains 0. Unfortunately this does 
not hold for small n: For an n of around 20 the probability that the interval 
contains 0 is only about 80% (Agresti and Coull, 1998). Regardless, for our 
sample in which y = 0 the Wald confidence interval comes out to be just a 
single point: zero. In fact, the 99.99% Wald interval also comes out to be zero. 
Certainly we would not want to conclude from the survey that we are 99.99% 
certain that no one in the city is infected. 

People have suggested a variety of alternatives to the Wald interval in 
hopes of avoiding this type of behavior. One type of confidence interval that 
performs well by non-Bayesian criteria is the “adjusted” Wald interval sug- 
gested by Agresti and Coull (1998), which is given by 


6 + 1.96\/6(1 — 6)/n , where 
f= n =a 4 1 


aed” pe 


While not originally motivated as such, this interval is clearly related to 
Bayesian inference: The value of 6 here is equivalent to the posterior mean for 
0 under a beta(2,2) prior distribution, which represents weak prior information 
centered around 0 = 1/2. 


General estimation of a population mean 


Given a random sample of n observations from a population, a standard es- 
timate of the population mean @ is the sample mean y. While y is generally 
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a reliable estimate for large sample sizes, as we saw in the example it can be 
statistically unreliable for small n, in which case it serves more as a summary 
of the sample data than as a precise estimate of 0. 

If our interest lies more in obtaining an estimate of 0 than in summarizing 
our sample data, we may want to consider estimators of the form 


A n w 
ĝ = J+ 00, 
n+w” ntw° 


where 0o represents a “best guess” at the true value of 0 and w represents a 
degree of confidence in the guess. If the sample size is large, then y is a reliable 
estimate of 0. The estimator 6 takes advantage of this by having its weights 
on y and @ go to one and zero, respectively, as n increases. As a result, the 
statistical properties of y and 6 are essentially the same for large n. However, 
for small n the variability of y might be more than our uncertainty about 60. 
In this case, using 6 allows us to combine the data with prior information to 
stabilize our estimation of 8. 

These properties of Ê for both large and small n suggest that it is a useful 
estimate of 0 for a broad range of n. In Section 5.4 we will confirm this by 
showing that, under some conditions, 6 outperforms ¥ as an estimator of 0 for 
all values of n. As we saw in the infection rate example and will see again in 
later chapters, 6 can be interpreted as a Bayesian estimator using a certain 
class of prior distributions. Even if a particular prior distribution p(@) does not 
exactly reflect our prior information, the corresponding posterior distribution 
p(6|y) can still be a useful means of providing stable inference and estimation 
for situations in which the sample size is low. 


1.2.2 Building a predictive model 


In Chapter 9 we will discuss an example in which our task is to build a pre- 
dictive model of diabetes progression as a function of 64 baseline explanatory 
variables such as age, sex and body mass index. Here we give a brief synopsis of 
that example. We will first estimate the parameters in a regression model us- 
ing a “training” dataset consisting of measurements from 342 patients. We will 
then evaluate the predictive performance of the estimated regression model 
using a separate “test” dataset of 100 patients. 


Sampling model and parameter space 


Letting Y; be the diabetes progression of subject i and £i = (%i1,..-,i,64) 
be the explanatory variables, we will consider linear regression models of the 
form 

Y; = bizia + Goxi2 +--+ + BeaViga + 06. 


The sixty-five unknown parameters in this model are the vector of regression 
coefficients 3 = ((1,..., G64) as well as o, the standard deviation of the error 
term. The parameter space is 64-dimensional Euclidean space for @ and the 
positive real line for ø. 
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Prior distribution 


In most situations, defining a joint prior probability distribution for 65 pa- 
rameters that accurately represents prior beliefs is a near-impossible task. As 
an alternative, we will use a prior distribution that only represents some as- 
pects of our prior beliefs. The main belief that we would like to represent is 
that most of the 64 explanatory variables have little to no effect on diabetes 
progression, i.e. most of the regression coefficients are zero. In Chapter 9 we 
will discuss a prior distribution on @ that roughly represents this belief, in 
that each regression coefficient has a 50% prior probability of being equal to 
zero. 


Posterior distribution 


Given data y = (y1,.--,Y342) and X = (a1,...,%342), the posterior distribu- 
tion p(Gly, X) can be computed and used to obtain Pr(@; # Oly, X) for each 
regression coefficient j. These probabilities are plotted in the first panel of 
Figure 1.3. Even though each of the sixty-four coefficients started out with a 
50-50 chance of being non-zero in the prior distribution, there are only six (;’s 
for which Pr(@; 4 O|y,X) > 0.5. The vast majority of the remaining coeffi- 
cients have high posterior probabilities of being zero. This dramatic increase 
in the expected number of zero coefficients is a result of the information in the 
data, although it is the prior distribution that allows for such zero coefficients 
in the first place. 
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Fig. 1.3. Posterior probabilities that each coefficient is non-zero. 
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Predictive performance and comparison to non-Bayesian methods 


We can evaluate how well this model performs by using it to predict the test 
data: Let Braves = E[Bly, X] be the posterior expectation of 6, and let Xtest 
be the 100 x 64 matrix giving the data for the 100 patients in the test dataset. 
We can compute a predicted value for each of the 100 observations in the test 
set using the equation Yrest = XGpayes- These predicted values can then be 
compared to the actual observations Yrest- A plot of Yrest VETSUS Yio, Appears 
in the first panel of Figure 1.4, and indicates how well B Bayes is able to predict 
diabetes progression from the baseline variables. 

How does this Bayesian estimate of 6B compare to a non-Bayesian ap- 
proach? The most commonly used estimate of a vector of regression coeffi- 
cients is the ordinary least squares (OLS) estimate, provided in most if not all 
statistical software packages. The OLS regression estimate is the value Bas of 
B that minimizes the sum of squares of the residuals (SSR) for the observed 
data, 


nm 


SSR(B) = X (y: — B” ai)”, 


i=l 


and is given by the formula 3,,, = (X7X)~!X7y. Predictions for the test 
data based on this estimate are given by X,), and are plotted against the 
observed values in the second panel of Figure 1.4. Notice that using 3. 
gives a weaker relationship between observed and predicted values than using 
Pgayes: This can be quantified numerically by computing the average squared 
prediction error, $` (Ytest,i — Jtest,i)?/100, for both sets of predictions. The 
prediction error for OLS is 0.67, about 50% higher than the value of 0.45 we 
obtain using the Bayesian estimate. In this problem, even though our ad hoc 
prior distribution for 8 only captures the basic structure of our prior beliefs 
(namely, that many of the coefficients are likely to be zero), this is enough to 
provide a large improvement in predictive performance over the OLS estimate. 
The poor performance of the OLS method is due to its inability to recog- 
nize when the sample size is too small to accurately estimate the regression 
coefficients. In such situations, the linear relationship between the values of 
y and X in the dataset, quantified by Gis is often an inaccurate represen- 
tation of the relationship in the entire population. The standard remedy to 
this problem is to fit a “sparse” regression model, in which some or many 
of the regression coefficients are set to zero. One method of choosing which 
coefficients to set to zero is the Bayesian approach described above. Another 
popular method is the “lasso,” introduced by Tibshirani (1996) and studied 
extensively by many others. The lasso estimate is the value Biasso of B that 
minimizes SSR(G: A), a modified version of the sum of squared residuals: 


n 


p 
SSR(B: A) = X (yi — £76) +A X |6; 
j=l 


i=l 


1.3 Where we are going 11 


Ytest Ytest 


Fig. 1.4. Observed versus predicted diabetes progression values using the Bayes 
estimate (left panel) and the OLS estimate (right panel). 


In other words, the lasso procedure penalizes large values of |8;|. Depending 
on the size of A, this penalty can make some elements of Brosso equal to 
zero. Although the lasso procedure has been motivated by and studied in 
a non-Bayesian context, in fact it corresponds to a Bayesian estimate using 
a particular prior distribution: The lasso estimate is equal to the posterior 
mode of @ in which the prior distribution for each 3; is a double-exponential 
distribution, a probability distribution that has a sharp peak at 6; = 0. 


1.3 Where we are going 


As the above examples indicate, the uses of Bayesian methods are quite broad. 
We have seen how the Bayesian approach provides 


e models for rational, quantitative learning; 
e estimators that work for small and large sample sizes; 
e methods for generating statistical procedures in complicated problems. 


An understanding of the benefits and limits of Bayesian methods comes with 
experience. In the chapters that follow, we will become familiar with these 
methods by applying them to a large number of statistical models and data 
analysis examples. After a review of probability in Chapter 2, we will learn 
the basics of Bayesian data analysis and computation in the context of some 
simple one-parameter statistical models in Chapters 3 and 4. Chapters 5, 6 
and 7 discuss Bayesian inference with the normal and multivariate normal 
models. While important in their own right, normal models also provide the 
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building blocks of more complicated modern statistical methods, such as hi- 
erarchical modeling, regression, variable selection and mixed effects models. 
These advanced topics and others are covered in Chapters 8 through 12. 


1.4 Discussion and further references 


The idea of probability as a measure of uncertainty about unknown but de- 
terministic quantities is an old one. Important historical works include Bayes’ 
“An essay towards solving a Problem in the Doctrine of Chances” (Bayes, 
1763) and Laplace’s “A Philosophical Essay on Probabilities,” published in 
1814 and currently published by Dover (Laplace, 1995). 

The role of prior opinion in statistical inference was debated for much of 
the 20th century. Most published articles on this debate take up one side or an- 
other, and include mischaracterizations of the other side. More informative are 
discussions among statisticians of different viewpoints: Savage (1962) includes 
a short introduction by Savage, followed by a discussion among Bartlett, 
Barnard, Cox, Pearson and Smith, among others. Little (2006) considers the 
strengths and weaknesses of Bayesian and frequentist statistical criteria. Efron 
(2005) briefly discusses the role of different statistical philosophies in the last 
two centuries, and speculates on the interplay between Bayesian and non- 
Bayesian methods in the future of statistical science. 


2 


Belief, probability and exchangeability 


We first discuss what properties a reasonable belief function should have, and 
show that probabilities have these properties. Then, we review the basic ma- 
chinery of discrete and continuous random variables and probability distribu- 
tions. Finally, we explore the link between independence and exchangeability. 


2.1 Belief functions and probabilities 


At the beginning of the last chapter we claimed that probabilities are a way 
to numerically express rational beliefs. We do not prove this claim here (see 
Chapter 2 of Jaynes (2003) or Chapters 2 and 3 of Savage (1972) for details), 
but we do show that several properties we would want our numerical beliefs 
to have are also properties of probabilities. 


Belief functions 


Let F, G, and H be three possibly overlapping statements about the world. 
For example: 


F = { a person votes for a left-of-center candidate } 
G = { a person’s income is in the lowest 10% of the population } 
H = { a person lives in a large city } 


Let Be() be a belief function, that is, a function that assigns numbers to 
statements such that the larger the number, the higher the degree of belief. 
Some philosophers have tried to make this more concrete by relating beliefs 
to preferences over bets: 


e Be(F’) > Be(G) means we would prefer to bet F is true than G is true. 
We also want Be() to describe our beliefs under certain conditions: 


e Be(F|H) > Be(G|H) means that if we knew that H were true, then we 
would prefer to bet that F is also true than bet G is also true. 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_2, 
© Springer Science+Business Media, LLC 2009 
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e Be(F|G) > Be(F|H) means that if we were forced to bet on F, we would 
prefer to do it under the condition that G is true rather than H is true. 


Axioms of beliefs 


It has been argued by many that any function that is to numerically represent 
our beliefs should have the following properties: 


B1 Be(not H|H) < Be(F|H) < Be(H|H) 
B2 Be(F or G|H) > max{Be(F|H), Be(G|H)} 
B3 Be(F and G|H) can be derived from Be(G|H) and Be(F|G and H) 


How should we interpret these properties? Are they reasonable? 


B1 says that the number we assign to Be(F|H), our conditional belief in F 


given H, is bounded below and above by the numbers we assign to complete 
disbelief (Be(not H|H)) and complete belief (Be(H|H)). 


B2 says that our belief that the truth lies in a given set of possibilities should 
not decrease as we add to the set of possibilities. 


B3 is a bit trickier. To see why it makes sense, imagine you have to decide 
whether or not F and G are true, knowing that H is true. You could do this 
by first deciding whether or not G is true given H, and if so, then deciding 
whether or not F is true given G and H. 


Axioms of probability 


Now let’s compare B1, B2 and B3 to the standard axioms of probability. 
Recall that F UG means “F or G? F NG means “F and G” and 0 is the 
empty set. 

P1 0= Pr(not H|H) < Pr(F|H) < Pr(A|H) =1 

P2 Pr(F U G|H) = Pr(F|H) + Pr(G|A) if FAG =Q 

P3 Pr(F AN G|H) = Pr(G|H) Pr(F|Gn H) 

You should convince yourself that a probability function, satisfying P1, P2 
and P3, also satisfies B1, B2 and B3. Therefore if we use a probability 
function to describe our beliefs, we have satisfied the axioms of belief. 


2.2 Events, partitions and Bayes’ rule 


Definition 1 (Partition) A collection of sets {H1,..., Hg} is a partition 
of another set H if 


1. the events are disjoint, which we write as H; 0 H; =0 for i £ j; 
2. the union of the sets is H, which we write as UK Ak =H. 


In the context of identifying which of several statements is true, if H is the 
set of all possible truths and {H1,...,H«} is a partition of H, then exactly 
one out of {H;,..., Hg} contains the truth. 
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Examples 


e Let H be someone’s religious orientation. Partitions include 
— {Protestant, Catholic, Jewish, other, none}; 
— {Christian, non-Christian}; 
— {atheist, monotheist, multitheist}. 
e Let H be someone’s number of children. Partitions include 
— {0, 1, 2, 3 or more}; 
= {0,1, 2, 3, 4, 5, 6,...}. 
e Let H be the relationship between smoking and hypertension in a given 
population. Partitions include 
— {some relationship, no relationship}; 
— {negative correlation, zero correlation, positive correlation}. 


Partitions and probability 
Suppose {H;,..., Hg} is a partition of H, Pr(H) = 1, and E is some specific 


event. The axioms of probability imply the following: 


K 
Rule of total probability : 5 Pr(H;,) =1 
k=1 


K 
Rule of marginal probability : Pr(E) = XO Pr(E N Ay) 


Pr(E|H;) Pr(H;) 
Pr(£) 
_ __ Pr(E|H;) Pr(H;) 
Dear Pr(B| He) Pre) 


Bayes’ rule: Pr(H;|E) = 


Example 


A subset of the 1996 General Social Survey includes data on the education level 
and income for a sample of males over 30 years of age. Let { H1, H2, H3, H4} be 
the events that a randomly selected person in this sample is in, respectively, 
the lower 25th percentile, the second 25th percentile, the third 25th percentile 
and the upper 25th percentile in terms of income. By definition, 


{Pr(H), Pr(H2), Pr(H3), Pr(H4)} = {.25, .25, .25, .25}. 
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Note that {H,, H2, H3, H4} is a partition and so these probabilities sum to 1. 
Let E be the event that a randomly sampled person from the survey has a 
college education. From the survey data, we have 


{Pr(B|H,), Pr(E|H2), Pr(E| H3), Pr(E|H4)} = {.11, 19, 31, 53}. 


These probabilities do not sum to 1 - they represent the proportions of people 
with college degrees in the four different income subpopulations Hı, H2, H3 
and H4. Now let’s consider the income distribution of the college-educated 
population. Using Bayes’ rule we can obtain 


{Pr(H,|E), Pr(H2|E), Pr(H3|E), Pr(H4|E)} = {.09, 17, .27,.47}, 


and we see that the income distribution for people in the college-educated 
population differs markedly from {.25, .25,.25,.25}, the distribution for the 
general population. Note that these probabilities do sum to 1 - they are the 
conditional probabilities of the events in the partition, given E. 


In Bayesian inference, {H),...,H} often refer to disjoint hypotheses or 
states of nature and E refers to the outcome of a survey, study or experiment. 
To compare hypotheses post-experimentally, we often calculate the following 
ratio: 


Pr(H;|E) _ Pr(£|H;) Pr(H:)/ Pr(E) 
Pr(H;|E)  Pr(E|H;)Pr(H;)/Pr(E 
_ Pr(E|H;) Pr(H;) 
~ Pr(£|H;) Pr(H;) 
Pr(£|H;)  Pr(H;) 
~ PEJE) ~ Pr(H;) 


= “Bayes factor” x “prior beliefs” . 


This calculation reminds us that Bayes’ rule does not determine what our 
beliefs should be after seeing the data, it only tells us how they should change 
after seeing the data. 


Example 


Suppose we are interested in the rate of support for a particular candidate for 
public office. Let 


H = { all possible rates of support for candidate A }; 

H, = { more than half the voters support candidate A }; 

Hə = { less than or equal to half the voters support candidate A }; 

E = { 54 out of 100 people surveyed said they support candidate A }. 


Then {H, Hə} is a partition of H. Of interest is Pr(y|E), or Pr(H1|E)/ Pr( H2|E). 
We will learn how to obtain these quantities in the next chapter. 
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2.3 Independence 


Definition 2 (Independence) Two events F and G are conditionally inde- 
pendent given H if Pr(F A0 G|H) = Pr(F|H) Pr(G|#). 


How do we interpret conditional independence? By Axiom P3, the following 
is always true: 
Pr(F N G|H) = Pr(G|H) Pr(F|H NG). 


If F and G are conditionally independent given H, then we must have 


Pr(G|H) Pr(F|H n G) E Pr(F n GJE) OPE Pr( FH) Pr(G|H) 
Pr(G|H) Pr(F|H NG) = Pr(F|H) Pr(G|H) 
Pr(F|H NG) Z Pr(F|H). 


Conditional independence therefore implies that Pr(F|H N G) = Pr(F'|#). In 
other words, if we know H is true and F and G are conditionally independent 
given H, then knowing G does not change our belief about F. 


Examples 


Let’s consider the conditional dependence of F and G when H is assumed to 
be true in the following two situations: 


F = { a hospital patient is a smoker } 
G = { a hospital patient has lung cancer } 
H = { smoking causes lung cancer} 


F = { you are thinking of the jack of hearts } 
G = { a mind reader claims you are thinking of the jack of hearts } 
H = { the mind reader has extrasensory perception } 


In both of these situations, H being true implies a relationship between F 
and G. What about when H is not true? 


2.4 Random variables 


In Bayesian inference a random variable is defined as an unknown numeri- 
cal quantity about which we make probability statements. For example, the 
quantitative outcome of a survey, experiment or study is a random variable 
before the study is performed. Additionally, a fixed but unknown population 
parameter is also a random variable. 
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2.4.1 Discrete random variables 


Let Y be a random variable and let V be the set of all possible values of Y. 
We say that Y is discrete if the set of possible outcomes is countable, meaning 
that V can be expressed as V = {y1, ya,..-}. 


Examples 


e Y = number of churchgoers in a random sample from a population 
e Y = number of children of a randomly sampled person 
e Y = number of years of education of a randomly sampled person 


Probability distributions and densities 


The event that the outcome Y of our survey has the value y is expressed as 
{Y = y}. For each y € Y, our shorthand notation for Pr(Y = y) will be p(y). 
This function of y is called the probability density function (pdf) of Y, and it 
has the following properties: 


1. 0 < p(y) < 1 for ally € YV; 


General probability statements about Y can be derived from the pdf. For 
example, Pr(Y € A) = $2,4 p(y). If A and B are disjoint subsets of Y, then 


Pr(Y € Aor Y € B) = Pr(Y € AUB) = Pr(Y € A) + Pr(Y € B) 


=X > p(y) + J p) 


yeA yEB 


Example: Binomial distribution 


Let Y = {0,1,2,...,n} for some positive integer n. The uncertain quantity 
Y € Yy has a binomial distribution with probability 0 if 


Pr(Y = y|0) = dbinom(y, n, 0) = C) (1—0). 
y 


For example, if 0 = .25 and n = 4, we have: 


4 

Pr(Y = 0|0 = .25) = (5) (.25)°(.75)4 = 816 

Pr(Y = 1/6 = 5) = (Pes = 422 
4 

Pr(Y = 2|6 = .25) “an (Sam ? — 211 
4 

Pr(Y = 3|0 = .25) = (0 = .047 
4 

Pr(Y = 4|0 = 225) = (e = .004. 
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Example: Poisson distribution 


Let Y = {0,1,2,...}. The uncertain quantity Y € Y has a Poisson distribution 
with mean 0 if 
Pr(Y = y|0) = dpois(y, 0) = 0%e~° /y!. 


For example, if 0 = 2.1 (the 2006 U.S. fertility rate), 
Pr(Y = 0/0 = 2.1) = (2.1)%e-21/(0!) = 12 
Pr(Y = 10 = 2.1) = (2.1)'e-2-4/(1!) = .26 
Pr(Y = 2|9 = 2.1) = (2.1)2e-21/(2!) = 27 
Pr(Y = 3/0 = 2.1) = (2.1)8e-?-4/(3!) = 19 
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Fig. 2.1. Poisson distributions with means of 2.1 and 21. 


2.4.2 Continuous random variables 


Suppose that the sample space Y is roughly equal to R, the set of all real 
numbers. We cannot define Pr(Y < 5) as equal to >7,,<5; p(y) because the 
sum does not make sense (the set of real numbers less than or equal to 5 is 
“uncountable” ). So instead of defining probabilities of events in terms of a pdf 
p(y), courses in mathematical statistics often define probability distributions 
for random variables in terms of something called a cumulative distribution 
function, or cdf: 
F(y) = Pr(Y < y). 


Note that F(œ0) = 1, F(—oo) = 0, and F(b) < F(a) if b < a. Probabilities of 
various events can be derived from the cdf: 
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e Pr(Y >a)=1- F(a) 
e Pr(a< Y <b) = F(b) — F(a) 


If F is continuous (i.e. lacking any “jumps”), we say that Y is a continuous 
random variable. A theorem from mathematics says that for every continuous 
cdf F there exists a positive function p(y) such that 


This function is called the probability density function of Y, and its properties 
are similar to those of a pdf for a discrete random variable: 


1. 0 < p(y) for all y € J; 
2 Jyer PY) dy = 1. 


As in the discrete case, probability statements about Y can be derived from 
the pdf: Pr(Y € A) = f e4 p(y) dy, and if A and B are disjoint subsets of Y, 
then 


Pr(Y € Aor Y € B) =Pr(Y € AUB) = Pr(Y € A)+ Pr(Y € B) 


[wae | 
a yEB 


Comparing these properties to the analogous properties in the discrete case, 
we see that integration for continuous distributions behaves similarly to sum- 
mation for discrete distributions. In fact, integration can be thought of as a 
generalization of summation for situations in which the sample space is not 
countable. However, unlike a pdf in the discrete case, the pdf for a continuous 
random variable is not necessarily less than 1, and p(y) is not “the probability 
that Y = y.” However, if p(y1) > p(y2) we will sometimes informally say that 
1 “has a higher probability” than yə. 


Example: Normal distribution 


Suppose we are sampling from a population on Y = (—oo, 00), and we know 
that the mean of the population is u and the variance is o°. Among all prob- 
ability distributions having a mean of u and a variance of ø?, the one that is 
the most “spread out” or “diffuse” (in terms of a measure called entropy), is 
the normal(u, o°) distribution, having a cdf given by 


1 = 2 
Pr(Y < yļu, o° pef. a f 5 ( —) dy. 


Evidently, 


1 1 _ 2 
P(y|H, o°) = dnorm(y, u, o) = V2no aod 2 F] k 


2.4 Random variables 21 


Letting u = 10.75 and o = .8 (o? = .64) gives the cdf and density in Figure 
2.2. This mean and standard deviation make the median value of eY equal 
to about 46,630, which is about the median U.S. household income in 2005. 
Additionally, Pr(eY > 100000) = Pr(Y > log 100000) = 0.17, which roughly 
matches the fraction of households in 2005 with incomes exceeding $100,000. 
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Fig. 2.2. Normal distribution with mean 10.75 and standard deviation 0.8. 


2.4.3 Descriptions of distributions 


The mean or expectation of an unknown quantity Y is given by 


ELY] = ) ` ey yP(y) if Y is discrete; 
E[Y] = fjey YP 
The mean is the center of mass of the distribution. However, it is not in general 
equal to either of 


(y) dy if Y is continuous. 


the mode: “the most probable value of Y,” or 
the median: “the value of Y in the middle of the distribution.” 


In particular, for skewed distributions (like income distributions) the mean 
can be far from a “typical” sample value: see, for example, Figure 2.3. Still, 
the mean is a very popular description of the location of a distribution. Some 
justifications for reporting and studying the mean include the following: 


1. The mean of {Y,,...,Y;,} is a scaled version of the total, and the total is 
often a quantity of interest. 

2. Suppose you are forced to guess what the value of Y is, and you are 
penalized by an amount (Y — yeuess)”. Then guessing E[Y] minimizes your 
expected penalty. 
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3. In some simple models that we shall see shortly, the sample mean contains 
all of the information about the population that can be obtained from the 
data. 
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Fig. 2.3. Mode, median and mean of the normal and lognormal distributions, with 
parameters u = 10.75 and o = 0.8. 


In addition to the location of a distribution we are often interested in how 
spread out it is. The most popular measure of spread is the variance of a 
distribution: 


The variance is the average squared distance that a sample value Y will be 
from the population mean E[Y]. The standard deviation is the square root of 
the variance, and is on the same scale as Y. 

Alternative measures of spread are based on quantiles. For a continuous, 
strictly increasing cdf F, the a-quantile is the value ya such that F(ya) = 
Pr(Y < Ya) = a. The interquartile range of a distribution is the interval 
(y.25, 9.75), which contains 50% of the mass of the distribution. Similarly, the 
interval (y,925, Y.975) contains 95% of the mass of the distribution. 
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2.5 Joint distributions 


Discrete distributions 


Let 


e Yi, Y2 be two countable sample spaces; 
e Yı, Yə be two random variables, taking values in yj, 2 respectively. 


Joint beliefs about Yı and Y> can be represented with probabilities. For ex- 
ample, for subsets A C Y, and B C Və, Pr({Y, € A} NA {Y2 € B}) represents 
our belief that Y; is in A and that Y> is in B. The joint pdf or joint density 
of Yı and Yə is defined as 


PYi¥2(Y1, y2) = Pr({Y1 = y1} N {Y2 = yoh), for yı E€ Vi, yo E Və. 


The marginal density of Yı can be computed from the joint density: 


py, (y1) = Pr(Y1 = yı) 
= X. Pr({¥i = yi} NO {V2 = y2}) 


y2€ yo 


= > PYY (y1, y2) : 


y2€ yo 


The conditional density of Y> given {Y, = yı} can be computed from the joint 
density and the marginal density: 


Pram (an) = a 


= PYiY2 (yi, y2) 
Py; (yı) 


You should convince yourself that 


{py; , pyzy, } can be derived from py; yz, 
{Py>, Pyi|Yə } can be derived from py; yz, 
pyy, can be derived from {py,, pys|y, }» 
Pyy, can be derived from {Py;, Py, |Y2 J> 


but 
py y cannot be derived from {py, , pys }. 


The subscripts of density functions are often dropped, in which case the type 
of density function is determined from the function argument: p(y1) refers to 
Py, (Y1), P(y1, y2) refers to py, yo (41, Y2), P(yily2) refers to py; |y, (y1|y2), ete. 
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Example: Social mobility 


Logan (1983) reports the following joint distribution of occupational categories 
of fathers and sons: 


son’s occupation 
father’s occupation| farm operatives craftsmen sales professional 
farm 0.018 0.035 0.031 0.008 0.018 
operatives 0.002 0.112 0.064 0.032 0.069 
craftsmen 0.001 0.066 0.094 0.032 0.084 
sales 0.001 0.018 0.019 0.010 0.051 
professional 0.001 0.029 0.032 0.043 0.130 


Suppose we are to sample a father-son pair from this population. Let Yı be 
the father’s occupation and Y> the son’s occupation. Then 


Pr(¥2 = professional N Yı = farm) 
Pr(Y, = farm) 
O .018 


-018 + .035 + .031 + .008 + .018 
= 164, 


Pr(¥2 = professional|Y; = farm) 


I 


Continuous joint distributions 


If Yı and Y> are continuous we start with a cumulative distribution function. 
Given a continuous joint cdf Fy, y, (a,b) = Pr({Y1 < a} {Yo < b}), there is 
a function py,y, such that 


a b 
Fy yz (a, b) = / I PY1Y2 (y1, y2) dy2dyy . 


The function py,y, is the joint density of Yı and Y>. As in the discrete case, 
we have 


py: (y1) = Jos, Pyy (Y2; Y2) dye; 
° Pysy, (Y2ly1) = pyy: (Y1, y2)/Py; (Y1) : 


You should convince yourself that py,)y,(yely1) is an actual probability den- 
sity, i.e. for each value of yı it is a probability density for Y2. 

Mixed continuous and discrete variables 

Let Yı be discrete and Yə be continuous. For example, Yı could be occupa- 
tional category and Y> could be personal income. Suppose we define 


e a marginal density py, from our beliefs Pr(Y1 = y1); 
e aconditional density py,\y, (y2|y1) from Pr(Y2 < y2|¥1 = y1) = Fy,iv, (yay) 
as above. 
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The joint density of Yı and Y> is then 


PY1Y2(Y1, Y2) = Py; (Y1) X Py2|¥i (y2ly1), 


and has the property that 


Pr(Y1 € A, Yo € B) =f X pry (u92)? dyz- 
y2EB yıEA 


Bayes’ rule and parameter estimation 


Let 


0 = proportion of people in a large population who have a certain character- 
istic. 

Y = number of people in a small random sample from the population who 
have the characteristic. 


Then we might treat 0 as continuous and Y as discrete. Bayesian estimation 
of 0 derives from the calculation of p(6|y), where y is the observed value of Y. 
This calculation first requires that we have a joint density p(y, 0) representing 
our beliefs about 0 and the survey outcome Y. Often it is natural to construct 
this joint density from 


e p(0), beliefs about 6; 
e p(y|@), beliefs about Y for each value of 6. 


Having observed {Y = y}, we need to compute our updated beliefs about 0: 


P(Oly) = p(9, y)/p(y) = P(®)p(y|9)/ply) - 


This conditional density is called the posterior density of 0. Suppose ĝa and 
O, are two possible numerical values of the true value of 6. The posterior 
probability (density) of 6, relative to 0), conditional on Y = y, is 


P(Galy) _ P(9a)p(y|9a)/P(y) 

Poly) P(A) p(w) /p(y) 
_ Pa)P(y|Ga) 
P(9)p(ylOs) 


This means that to evaluate the relative posterior probabilities of 0a and 0p, 
we do not need to compute p(y). Another way to think about it is that, as a 
function of 0, 


Py) x p(A)p(yl9). 
The constant of proportionality is 1/p(y), which could be computed from 


p(y) = L p(y, 0) dd = , p(ylé)p(8) do 
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ii pOpale) 
py 
pon) = ae 


As we will see in later chapters, the numerator is the critical part. 


2.6 Independent random variables 


Suppose Y;,..., Yn are random variables and that 0 is a parameter describing 
the conditions under which the random variables are generated. We say that 
Y,,..-,Y, are conditionally independent given 0 if for every collection of n 
sets {Ai,..., An} we have 


Pr(¥i € A1,..., Yn € An|O) = Pr(¥1 € Ai|0) x +++ x Pr(Yn E€ An|A). 


Notice that this definition of independent random variables is based on our 
previous definition of independent events, where here each {Y; € A;} is an 
event. From our previous calculations, if independence holds, then 


Pr(Y; € A,|6,¥; € Aj) = Pr(¥; € A;|8), 


so conditional independence can be interpreted as meaning that Y; gives no 
additional information about Y; beyond that in knowing 0. Furthermore, under 
independence the joint density is given by 


P(Yr, ++ +s Ynl9) = py; (yi) x +++ X py, (YnlO) = IT (yil9), 


the product of the marginal densities. 

Suppose Y;,..., Y, are generated in similar ways from a common process. 
For example, they could all be samples from the same population, or runs 
of an experiment performed under similar conditions. This suggests that the 
marginal densities are all equal to some common density giving 


P(Y, +--+ Ynl9) = | [ p(vil9). 
i=1 


In this case, we say that Y1,..., Yn are conditionally independent and identi- 
cally distributed (i.i.d.). Mathematical shorthand for this is 


Y1,.-.,Yn|0 ~ iid. p(y). 
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2.7 Exchangeability 


Example: Happiness 


Participants in the 1998 General Social Survey were asked whether or not 
they were generally happy. Let Y; be the random variable associated with this 
question, so that 


__ J 1 if participant i says that they are generally happy, 
“| 0 otherwise. 


In this section we will consider the structure of our joint beliefs about 
Yi,.--,Y10, the outcomes of the first 10 randomly selected survey partici- 
pants. As before, let p(y1,...,Y10) be our shorthand notation for Pr(Y; = 
Y1,---;Y10 = Yio); Where each y; is either 0 or 1. 


Exchangeability 
Suppose we are asked to assign probabilities to three different outcomes: 


p(1,0,0,1,0,1,1,0,1,1) =? 
p(1,0,1,0,1,1,0,1,1,0) =? 
p(1,1,0,0,1,1,0,0,1,1) =? 


Is there an argument for assigning them the same numerical value? Notice 
that each sequence contains six ones and four zeros. 


Definition 3 (Exchangeable) Let p(y1,...,Yn) be the joint density of Yı, 


-< Yn. Lf p(y, ---5 Yn) = PlYn»- --, Ynn) for all permutations r of {1,...,n}, 
then Yı,..., Yn are exchangeable. 


Roughly speaking, Y1,..., Yn are exchangeable if the subscript labels convey 
no information about the outcomes. 


Independence versus dependence 
Consider the following two probability assignments: 


Pr(Yio9 = 1) =a 
Pr(Yio = 11%. = Yz = -- - = Ya = Yg = 1) =b 


Should we have a < b, a = b, or a > b? If a Æ b then Yio is NOT independent 
of Yi,..., Yo. 
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Conditional independence 


Suppose someone told you the numerical value of 0, the rate of happiness 
among the 1,272 respondents to the question. Do the following probability 
assignments seem reasonable? 

2 
z0 
> 


“o 


Pr(Yio = 1|8) 
Pr(Yio = 11%, = 1,..., Yo = yo, 0) 
2 
Pr(¥o = 1|Y1 = yi,...,¥e = Ys, Yio = Y10, 0) S 0 


If these assignments are reasonable, then we can consider the Y;’s as condition- 
ally independent and identically distributed given @, or at least approximately 
so: The population size of 1,272 is much larger than the sample size of 10, in 
which case sampling without replacement is approximately the same as i.i.d. 
sampling with replacement. Assuming conditional independence, 


Pr(¥i = yil, Yj = yj j # i) = OM (1-8) 
10 

Pr(Yi = y1,---, Yio = Y10|0) = [eG =a) 
i=l 


= 9r vi — g)®-2 Yi, 


If 0 is uncertain to us, we describe our beliefs about it with p(0), a prior 
distribution. The marginal joint distribution of Y1, ..., Yio is then 


1 1 
RT J E | gE v (1 — 0)1-E vp(9) db. 
0 0 


Now consider our probabilities for the three binary sequences given above: 
p(1,0,0,1,0,1,1,0,1,1) = f ee p(0) dé 
p(1,0,1,0,1,1,0,1,1,0) Era or 
p(1,1,0,0,1,1,0,0,1,1) = f 6° (1 p0) dé 


It looks like Yj,...,Y, are ont under this model of beliefs. 


Claim: 

If 0 ~ p(@) and Yi,...,Yn are conditionally i.i.d. given 0, then marginally 
(unconditionally on 0), Y1,...,Y;, are exchangeable. 

Proof: 

Suppose Y;,..., Yn are conditionally i.i.d. given some unknown parameter 6. 
Then for any permutation 7 of {1,...,n} and any set of values (y1,..-,Yn) € 


X”, 
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P(Y1,+++5Yn) = foo +; Ynl0)p(0) dO (definition of marginal probability) 


= / {Tein} p(0) d0 (Y; s are conditionally i.i.d.) 
i=1 


z J (Tous 


= plUs Yra) (definition of marginal probability) . 


9} p0) de (product does not depend on order) 


2.8 de Finetti’s theorem 


We have seen that 
Yi... , Yl iid 
8 ~ p(0) 


What about an arrow in the other direction? Let {Y1, Yo,...} be a potentially 
infinite sequence of random variables all having a common sample space J). 


Theorem 1 (de Finetti) Let Y; € Y for alli € {1,2,...}. Suppose that, for 
any n, our belief model for Yı, ..., Yn is exchangeable: 


\ => Yi,...,Yn are exchangeable. 


ply, + e3 Yn) = Phyris Yna) 


for all permutations n of {1,...,n}. Then our model can be written as 


TT - J {Tiro} no do 


for some parameter 0, some prior distribution on 0 and some sampling model 
p(y|0). The prior and sampling model depend on the form of the belief model 
P(Y1, + +++ Yn)- 

The probability distribution p(0) represents our beliefs about the outcomes of 
{Y1, Yo,...}, induced by our belief model p(y, y2,...). More precisely, 


p(0) represents our beliefs about limn—>oo >> Y;/n in the binary case; 
p(0) represents our beliefs about limp. X (Y; < c)/n for each c in the 
general case. 


The main ideas of this and the previous section can be summarized as follows: 


Yi,.-.-,¥n|@ are eet 
ao. ©yY,..., Yn are exchangeable for all n. 
0 ~ p(9) os i 
When is the condition “Y1,..., Yn are exchangeable for all n” reasonable? 


For this condition to hold, we must have exchangeability and repeatability. 
Exchangeability will hold if the labels convey no information. Situations in 
which repeatability is reasonable include the following: 
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Y,,...,Y, are outcomes of a repeatable experiment; 

Yi,.-.,Y, are sampled from a finite population with replacement; 

Yi,.-.,Y, are sampled from an infinite population without replacement. 
If Yi,..., Yn are exchangeable and sampled from a finite population of size 


N >> n without replacement, then they can be modeled as approximately 
being conditionally i.i.d. (Diaconis and Freedman, 1980). 


2.9 Discussion and further references 


The notion of subjective probability in terms of a coherent gambling strategy 
was developed by de Finetti, who is of course also responsible for de Finetti’s 
theorem (de Finetti, 1931, 1937). Both of these topics were studied further by 
many others, including Savage (Savage, 1954; Hewitt and Savage, 1955). 

The concept of exchangeability goes beyond just the concept of an in- 
finitely exchangeable sequence considered in de Finetti’s theorem. Diaconis 
and Freedman (1980) consider exchangeability for finite populations or se- 
quences, and Diaconis (1988) surveys some other versions of exchangeability. 
Chapter 4 of Bernardo and Smith (1994) provides a guide to building statis- 
tical models based on various types of exchangeability. A very comprehensive 
and mathematical review of exchangeability is given in Aldous (1985), which 
in particular provides an excellent survey of exchangeability as applied to 
random matrices. 
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One-parameter models 


A one-parameter model is a class of sampling distributions that is indexed 
by a single unknown parameter. In this chapter we discuss Bayesian inference 
for two one-parameter models: the binomial model and the Poisson model. In 
addition to being useful statistical tools, these models also provide a simple 
environment within which we can learn the basics of Bayesian data analysis, 
including conjugate prior distributions, predictive distributions and confidence 
regions. 


3.1 The binomial model 
Happiness data 


Each female of age 65 or over in the 1998 General Social Survey was asked 
whether or not they were generally happy. Let Y; = 1 if respondent 7 reported 
being generally happy, and let Y; = 0 otherwise. If we lack information dis- 
tinguishing these n = 129 individuals we may treat their responses as being 
exchangeable. Since 129 is much smaller than the total size N of the female 
senior citizen population, the results of the last chapter indicate that our joint 
beliefs about Y,,..., Yi29 are well approximated by 


e our beliefs about 0 = Sh Y;/N; 

e the model that, conditional on 0, the Y;’s are i.i.d. binary random variables 
with expectation 0. 

The last item says that the probability for any potential outcome {y1,..., y129}, 

conditional on 0, is given by 


9 129 , 


129). 7 
D(y1,--+,Y129|8) = 02 :=1 (1 g)129 ue 


What remains to be specified is our prior distribution. 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_3, 
© Springer Science+Business Media, LLC 2009 
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A uniform prior distribution 


The parameter 0 is some unknown number between 0 and 1. Suppose our 
prior information is such that all subintervals of [0,1] having the same length 
also have the same probability. Symbolically, 


Pr(a<0<b)=Pr(ęąa+c<0<b+c) for O<a<b<b4+cK< 1. 
This condition implies that our density for 0 must be the uniform density: 
p(0) = 1 for all 6 € [0, 1]. 
For this prior distribution and the above sampling model, Bayes’ rule gives 


jalan 8 0 
blune) = tla) 
p(y, see ,Y129) 


1 


= P(y1,---; Y129/8 x SS E 
( | ) p(y,- --,Y129) 


x p(yı, --- , Y12908) .- 


The last line says that in this particular case p(6|y1,..-,Yyi129) and p(y, .-., 
yi29|0) are proportional to each other as functions of 0. This is because the 
posterior distribution is equal to p(y1,.--,Y129|0) divided by something that 
does not depend on 0. This means that these two functions of 0 have the same 
shape, but not necessarily the same scale. 


Data and posterior distribution 


e 129 individuals surveyed; 
e 118 individuals report being generally happy (91%); 
e 11 individuals do not report being generally happy (9%). 


The probability of these data for a given value of @ is 
pyi, s.. , Y12910) = 018 (1 = leer 


A plot of this probability as a function of 8 is shown in the first plot of Figure 
3.1. Our result above about proportionality says that the posterior distribution 
p(Oly1,---,Y129) will have the same shape as this function, and so we know 
that the true value of 0 is very likely to be near 0.91, and almost certainly 
above 0.80. However, we will often want to be more precise than this, and 
we will need to know the scale of p(6|y1,..., Yn) as well as the shape. From 
Bayes’ rule, we have 


p(Olyi1,--.,y129) = OS (1 — 6)" x p(O)/p(y1,.--, y129) 
= aoe Gl = o)" x 1/p(yı, ae ,Y129)- 
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Fig. 3.1. Sampling probability of the data as a function of 6, along with the 
posterior distribution. Note that a uniform prior distribution (plotted in gray in 
the second panel) gives a posterior distribution that is proportional to the sampling 
probability. 


It turns out that we can calculate the scale or “normalizing constant” 
1/p(y1,---,Y129) using the following result from calculus: 


! a—1 —1 _ T (a)r (b) 
fe (1—8)? MoT 


(the value of the gamma function I(x) for any number x > 0 can be looked 

up in a table, or with R using the gamma() function). How does the calculus 

result help us compute p(6|y1,...,Y129)? Let’s recall what we know about 

P(Oly1,-- +5 Y129): 

(a) J p(Oly1,---,Y129) d@ = 1, since all probability distributions integrate or 
sum to 1; 

(b) p(O\y1,---,Y129) = 8118(1 — 0)! /p(y1, . - -, Y129), from Bayes’ rule. 


Therefore, 
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1 
1 3 p(Olyr, ---,Y129) dé using (a) 
0 


1 
l= he ma Ol a 0)" /p(yi, -< +5 Y129) dO using (b) 
0 


l= wool od j“ do 
pyi. ON 


es r(119)P(12) 
pyi,- a (131) 


using the calculus result, and so 


_ TDTU?) 
Py, tee , Y129) = ei, ` 
You should convince yourself that this result holds for any sequence {y1,..., Y129 } 


that contains 118 ones and 11 zeros. Putting everything together, we have 


131 
P(Oly1,---,Yi29) = a 61'8(1 — 9)", which we will write as 
T (131) 119—1 12-1 
= = 
ragrag 9) 


This density for @ is called a beta distribution with parameters a = 119 and 
b = 12, which can be calculated, plotted and sampled from in R using the 
functions dbeta() and rbeta() . 


The beta distribution 


An uncertain quantity 0, known to be between 0 and 1, has a beta(a, b) dis- 
tribution if 


I'(a+b) 


p(0) = dbeta(6, a, b) = Tart) 


eae ford <6 <1. 


For such a random variable, 


mode[6] = (a—1)/[(a—1) + (b-1)] if a > 1 and b > 1; 

B[6] = a/(a +8) 

Var[6] = ab/[(a +b + 1)(a + b)?] = E[6] x Ef — 0]/(a +b + 1). 
For our data on happiness in which we observed (Y1,..., Yi29) = (y1,---, Y129) 
with S7)29 y; = 118, 


mode[6|y1, E Y129| = 0.915; 


F[Aly, eins , Y129] = 0.908; 
sd[O|y1, awi Y129] = 0.025. 
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3.1.1 Inference for exchangeable binary data 
Posterior inference under a uniform prior 
If Y1,..., Yn|@ are ii.d. binary(@), we showed that 
plying) = 0% ¥(1 — 8)"-> * x p(9)/p(y1,---19n)- 


If we compare the relative probabilities of any two 6-values, say 0, and 0p, we 
see that 


D(BalYis--+1Yn) _ Oar (1 Oa)" E H x p(Ba)/P(yrs-+++Yn) 
P(|Y15+++5Ym) GE H — O)"-X % x p()/D Yr, «+s Yn) 


E n = (Oa) 
Oy 1—0 pls) ` 
This shows that the probability density at 0a relative to that at 0, depends 
on y1,- -, Yn only through X`; yi. From this, you can show that 


Pr(0 € Al, = y1,.-., Yn = Yn) = Pr (ea$ = Sou] ; 
i=l i=1 


We interpret this as meaning that 57/"_, Y; contains all the information about 
6 available from the data, and we say that }>""_, Y; is a sufficient statistic for 
6 and p(y1,---,Yn|A). The word “sufficient” is used because it is “sufficient” to 
know 5° Y; in order to make inference about 0. In this case where Yj,...,Y;,|0 
are i.i.d. binary(9) random variables, the sufficient statistic Y = 7i"_, Y; has 
a binomial distribution with parameters (n, 0). 


The binomial distribution 


A random variable Y € {0,1,...,n} has a binomial(n, 0) distribution if 
Pr(Y = y|0) = dbinom(y, n, 0) = C) 0”(1— 60)", ye {0,1,...,n}. 
y 


Binomial distributions with different values of n and 0 are plotted in Figures 
3.2 and 3.3. For a binomial(n, 0) random variable, 


E[Y |0] = nð; 
Var[Y |0] = n0 (1 — 0). 
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Fig. 3.2. Binomial distributions with n = 10 and @ € {0.2, 0.8}. 
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Fig. 3.3. Binomial distributions with n = 100 and 8 € {0.2, 0.8}. 


Posterior inference under a uniform prior distribution 
Having observed Y = y our task is to obtain the posterior distribution of 0: 


p(yl)p(A) 
p(y) 
(P)O = 0)"=p(0) 
p(y) 
= c(y)0” (1 — 0)” ”p(0) 


P(Aly) = 
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where c(y) is a function of y and not of 6. For the uniform distribution with 
p(0) = 1, we can find out what c(y) is using our calculus trick: 


is f c(y)6¥(1 — 0)” dO 


I= aw) f 0” (1 — 0)” do 


rly +1) (n-y+1) 


1 = ely) T(n +2) 


The normalizing constant c(y) is therefore equal to T(n + 2)/{I'(y+ 1) (n—- 
y +1)}, and we have 


— T(n w 2) yY n—y 
— T(n +2) (yt1)-174 _ g\(n—y41)-1 
(yt 1)r(n-ytl1 i p= 


= beta(y+ 1,n— y + 1). 
Recall the happiness example, where we observed that Y = X` Y; = 118: 
n=129,Y = Y; =118 = 6|{Y = 118} ~ beta(119, 12). 


This confirms the sufficiency result for this model and prior distribution, by 
showing that if X` y; = y = 118, 


plOlyi, ---, Yn) = ply) = beta(119, 12). 


In other words, the information contained in {Y1 = y1,..., Yn = Yn} is the 
same as the information contained in {Y = y}, where Y = X Y; and y = D> yi. 


Posterior distributions under beta prior distributions 


The uniform prior distribution has p(0) = 1 for all 0 € [0,1]. This distribution 
can be thought of as a beta prior distribution with parameters a = 1,b = 1: 


1 
et l SL 
oon) imi” 


8 = 
AS Tn 
Note that P(x +1) = a! = x x (x —1)--- x 1 if x is a positive integer, and 
I'(1) = 1 by convention. In the previous paragraph, we saw that 


if 0 ~ beta(1, 1) (uniform) 
i Y ~ binomial(n, 0) 


\ , then {0|Y = y} ~ beta(1+y,l+n—y), 


and so to get the posterior distribution when our prior distribution is beta(a = 
1,b = 1), we can simply add the number of 1’s to the a parameter and the 
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number of 0’s to the b parameter. Does this result hold for arbitrary beta pri- 
ors? Let’s find out: Suppose 0 ~ beta(a,b) and Y|@ ~ binomial(n, 0). Having 
observed Y = y, 


pOl) = P(O)r(yl4) 
) 


ply 

_ i (a+b) ai a POON 
a) Tare ok C) (1—8) 

= e(n, y, a,b) x 0°+9-1(1 — gybtn--1 


= dbeta(O,a+y,b+n—y). 


It is important to understand the last two lines above: The second to last line 
says that p(0|y) is, as a function of 0, proportional to 02+¥~! x (1—9)?t"-9- 1, 
This means that it has the same shape as the beta density dbeta(0, a+-y, b+n— 
y). But we also know that p(@|y) and the beta density must both integrate to 1, 
and therefore they also share the same scale. These two things together mean 
that p(0|y) and the beta density are in fact the same function. Throughout the 
book we will use this trick to identify posterior distributions: We will recognize 
that the posterior distribution is proportional to a known probability density, 
and therefore must equal that density. 


Conjugacy 


We have shown that a beta prior distribution and a binomial sampling model 
lead to a beta posterior distribution. To reflect this, we say that the class of 
beta priors is conjugate for the binomial sampling model. 


Definition 4 (Conjugate) A class P of prior distributions for 0 is called 
conjugate for a sampling model p(y|@) if 


p0) E P = ply) € P. 


Conjugate priors make posterior calculations easy, but might not actually 
represent our prior information. However, mixtures of conjugate prior distri- 
butions are very flexible and are computationally tractable (see Exercises 3.4 
and 3.5). 


Combining information 
If OY = y} ~ beta(a + y,b +n — y), then 


a+y-l1 
at+b+n—2 


at+y 


E[4lyJE[1 — Aly] 
a+b+n 


E[6ly] = 
[lx atbt+n+1 


, mode(4|y] = , Var[6|y] = 
The posterior expectation E[6|y] is easily recognized as a combination of prior 
and data information: 
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Fig. 3.4. Beta posterior distributions under two different sample sizes and two dif- 
ferent prior distributions. Look across a row to see the effect of the prior distribution, 
and down a column to see the effect of the sample size. 


a+y 
Epl] = <H 
— a+b a n y 
a+b+na+b a+b+nn 
aT? pri tation + —-— x dat 
= —_ rior expectation + ————— ata average . 
a+b+n p p a+b+n 8 


For this model and prior distribution, the posterior expectation (also known 
as the posterior mean) is a weighted average of the prior expectation and the 
sample average, with weights proportional to a+ b and n respectively. This 
leads to the interpretation of a and b as “prior data”: 


a ~ “prior number of 1’s,” 
b x “prior number of 0’s,” 
a+b 7x “prior sample size.” 


If our sample size n is larger than our prior sample size a + b, then it seems 
reasonable that a majority of our information about 0 should be coming from 
the data as opposed to the prior distribution. This is indeed the case: For 
example, ifn >> a + b, then 


a+b yY 
n 


n 2 n ly J 
~O, Eola 2 , Varol) ~ =t (1-2). 
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Prediction 


An important feature of Bayesian inference is the existence of a predictive 
distribution for new observations. Reverting for the moment to our notation 
for binary data, let y1,...,Yn be the outcomes from a sample of n binary 
random variables, and let Y € {0,1} be an additional outcome from the same 
population that has yet to be observed. The predictive distribution of Y is the 
conditional distribution of Y given {Y1 = y1,---,; Yn = Yn}. For conditionally 
i.i.d. binary variables this distribution can be derived from the distribution of 
y given 0 and the posterior distribution of 0: 


Pr(Y = Llyi,---5Yn) = pew =1,0ly1,..-,Yn) dð 
= [eur= 1], y1,---,Yn)P(Oly1,---;Yn) dO 


7 J Sie a8 


a+ Yai Yi 
a+b+n 


ni b+ = 1-y; 
Pr(Y = Olyi,---, Yn) =1—E[6ly,..-, yn] = rani 3 


= E/Aly,-.-,Yn] = 


You should notice two important things about the predictive distribution: 


1. The predictive distribution does not depend on any unknown quantities. 
If it did, we would not be able to use it to make predictions. 

2. The predictive distribution depends on our observed data. In this dis- 
tribution, Y is not independent of Yj,...,¥n (recall Section 2.7). This 
is because observing Y1,..., Y, gives information about 0, which in turn 
gives information about Y. It would be bad if Y were independent of 
Yi,..., Yn - it would mean that we could never infer anything about the 


unsampled population from the sample cases. 
Example 
The uniform prior distribution, or beta(1,1) prior, can be thought of as equiv- 


alent to the information in a prior dataset consisting of a single “1” and a 
single “0”. Under this prior distribution, 


Pr(Ý = 1|Y = y) = E/6|Y = y] j 


mode(6|Y = y) = 


sls 


3 


where Y = $`; Y;. Does the discrepancy between these two posterior sum- 
maries of our information make sense? Consider the case in which Y = 0, for 


which mode(6|Y = 0) = 0 but Pr(Y = 1|Y = 0) = 1/(2 + n). 


3.1 The binomial model Al 
3.1.2 Confidence regions 


It is often desirable to identify regions of the parameter space that are likely 
to contain the true value of the parameter. To do this, after observing the 
data Y = y we can construct an interval [I(y), u(y)] such that the probability 
that I(y) < 0 < u(y) is large. 


Definition 5 (Bayesian coverage) An interval [I(y),u(y)], based on the 
observed data Y = y, has 95% Bayesian coverage for 0 if 


Pri(l(y) < 0 < u(y)|Y = y) = .95. 


The interpretation of this interval is that it describes your information about 
the location of the true value of 0 after you have observed Y = y. This is 
different from the frequentist interpretation of coverage probability, which 
describes the probability that the interval will cover the true value before the 
data are observed: 


Definition 6 (frequentist coverage) A random interval [l(Y),u(Y)]| has 
95% frequentist coverage for 6 if, before the data are gathered, 


Pr(l(Y) < 0 < u(Y)|9) = .95. 


In a sense, the frequentist and Bayesian notions of coverage describe pre- and 
post-experimental coverage, respectively. 

You may recall your introductory statistics instructor belaboring the fol- 
lowing point: Once you observe Y = y and you plug this data into your 
confidence interval formula [/(y), u(y)], then 


Pr(l(y) < 0 < u(y)|9) = a i K 


This highlights the lack of a post-experimental interpretation of frequentist 
coverage. Although this may make the frequentist interpretation seem some- 
what lacking, it is still useful in many situations. Suppose you are running a 
large number of unrelated experiments and are creating a confidence interval 
for each one of them. If your intervals each have 95% frequentist coverage 
probability, you can expect that 95% of your intervals contain the correct 
parameter value. 

Can a confidence interval have the same Bayesian and frequentist coverage 
probability? Hartigan (1966) showed that, for the types of intervals we will 
construct in this book, an interval that has 95% Bayesian coverage additionally 
has the property that 


Pr(I(Y) <6 < u(Y)|0) = .95 + €n 


where |en| < © for some constant a. This means that a confidence interval 
procedure that gives 95% Bayesian coverage will have approximately 95% fre- 
quentist coverage as well, at least asymptotically. It is important to keep in 
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mind that most non-Bayesian methods of constructing 95% confidence inter- 
vals also only achieve this coverage rate asymptotically. For more discussion of 
the similarities between intervals constructed by Bayesian and non-Bayesian 
methods, see Severini (1991) and Sweeting (2001). 


Quantile-based interval 


Perhaps the easiest way to obtain a confidence interval is to use posterior 
quantiles. To make a 100 x (1 — a)% quantile-based confidence interval, find 
numbers 44/2 < 4;~./2 such that 

1. Pr(0 < 0a|Y = y) = a/2; 

2. Pr(0 > 0i-a/2|Y = y) = a/2. 
The numbers 94/2, 01—-a/2 are the a/2 and 1 — a/2 posterior quantiles of 9, 
and so 


Pr(6 (S l0a/2, 01-a /2]|Y = y) =1— Pr(@ g [9x/2s 01-072] Y = y) 
=1— [Pr(0 < bajol Y = y) + Pr(0 > bial Y = 


=l-a. 
Example: Binomial sampling and uniform prior 


Suppose out of n = 10 conditionally independent draws of a binary random 
variable we observe Y = 2 ones. Using a uniform prior distribution for 0, 
the posterior distribution is 6/{Y = 2} ~ beta(1 + 2,1+ 8). A 95% posterior 
confidence interval can be obtained from the .025 and .975 quantiles of this 
beta distribution. These quantiles are 0.06 and 0.52 respectively, and so the 
posterior probability that 6 € [0.06, 0.52] is 95%. 

> a<-l ; b<—-l #prior 

> n<—10 ; y<-2 Fdata 


> qbeta( c(.025,.975), aty,btn-y) 


[1] 0.06021773 0.51775585 


Highest posterior density (HPD) region 


Figure 3.5 shows the posterior distribution and a 95% confidence interval for 6 
from the previous example. Notice that there are 6-values outside the quantile- 
based interval that have higher probability (density) than some points inside 
the interval. This suggests a more restrictive type of interval: 


Definition 7 (HPD region) A 100 x (1 — a)% HPD region consists of a 
subset of the parameter space, s(y) C O such that 


1. Pr(0 € s(y)|Y =y) =1-a; 
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Fig. 3.5. A beta posterior distribution, with vertical bars indicating a 95% quantile- 
based confidence interval. 


2. If 0a € s(y), and & ¢ s(y), then p(0a|Y = y) > p(®|Y = y). 


All points in an HPD region have a higher posterior density than points out- 
side the region. However, an HPD region might not be an interval if the 
posterior density is multimodal (having multiple peaks). Figure 3.6 gives the 
basic idea behind the construction of an HPD region: Gradually move a hor- 
izontal line down across the density, including in the HPD region all 6-values 
having a density above the horizontal line. Stop moving the line down when 
the posterior probability of the -values in the region reaches (1 — a). For the 
binomial example above, the 95% HPD region is [0.04,0.048], which is nar- 
rower (more precise) than the quantile-based interval, yet both contain 95% 
of the posterior probability. 


3.2 The Poisson model 


Some measurements, such as a person’s number of children or number of 
friends, have values that are whole numbers. In these cases our sample space 
is Y = {0,1,2,...}. Perhaps the simplest probability model on yY is the Poisson 
model. 


Poisson distribution 


Recall from Chapter 2 that a random variable Y has a Poisson distribution 
with mean ð if 


Pr(Y = y|0) = dpois(y, 0) = 0%e~°/y! for y € {0,1,2,...} 


For such a random variable, 
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Fig. 3.6. Highest posterior density regions of varying probability content. The 
dashed line is the 95% quantile-based interval. 


e EY|6] = 9; 
e Var[Y|6] = 9. 
People sometimes say that the Poisson family of distributions has a “mean- 


variance relationship” because if one Poisson distribution has a larger mean 
than another, it will have a larger variance as well. 
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Fig. 3.7. Poisson distributions. The first panel shows a Poisson distribution with 
mean of 1.83, along with the empirical distribution of the number of children of 
women of age 40 from the GSS during the 1990s. The second panel shows the 
distribution of the sum of 10 i.i.d. Poisson random variables with mean 1.83. This 
is the same as a Poisson distribution with mean 18.3 
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3.2.1 Posterior inference 


If we model Yj,..., Yn as i.i.d. Poisson with mean 0, then the joint pdf of our 
sample data is as follows: 


Pr(¥t = Yiye Yn = Yn|9) = Ig (yil0) 


z Il ste" 
ga Ye 
= c(y1, ik ,Yn)0> ere 
Comparing two values of 0 a posteriori, we have 
B(Balg1y+++1Ym) _ C(Y1y+++5Yn) 7M" OF ™ p(Ba) 
P(Go|yas---sYn)  C(Y1- -3 Yn) E- Br vi p(B) 


_ 6m 6 p(Ba) 
e774 g> Vi pO) s 


As in the case of the i.i.d. binary model, }>;"_, Y; contains all the information 
about Ø that is available in the data, and again we say that 57", Y; is a 
sufficient statistic. Furthermore, {}7j"_, Yi{0} ~ Poisson(n). 


Conjugate prior 


For now we will work with a class of conjugate prior distributions that will 
make posterior calculations simple. Recall that a class of prior densities is 
conjugate for a sampling model p(y1,.-.., Yn|0) if the posterior distribution is 
also in the class. For the Poisson sampling model, our posterior distribution 
for 0 has the following form: 


P(A|y1,--+5Yn) x p(0) x p(y1,---s Yn) 
p() x OE vier? 


This means that whatever our conjugate class of densities is, it will have 
to include terms like 6“e~°2° for numbers cı and cy. The simplest class of 
such densities includes only these terms, and their corresponding probability 
distributions are known as the family of gamma distributions. 


Gamma distribution 


An uncertain positive quantity 0 has a gamma(a, b) distribution if 


a 


I(a) 


p(@) = dgamma(6, a,b) = ele for 6, a, b> 0. 


For such a random variable, 


46 3 One-parameter models 
e E[|0] = a/b; 
e Var/[6] = a/b’; 


° modelo] = { C7 P ifa>1 
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Fig. 3.8. Gamma densities. 


Posterior distribution of 0 
Suppose Yj,..., ¥n|0 ~ iid. Poisson(@) and p(@)=dgamma(0, a, b). Then 


P(Alyi,---,Yn) = P(A) x p(y, -- +, YnlP)/p(y1,-- +, Yn) 
= {001e} x {or ere x C(Y1,---5 Yn; a, b) 


= {ot deere) x C(Y1,+-+;Yn; a, b). 


This is evidently a gamma distribution, and we have confirmed the conjugacy 
of the gamma family for the Poisson sampling model: 


6 ~ gamma(a, b) 


Y1,.--;Yn|@ ~ Poisson(0) \ > (91%... Yn} ~ gamma(a + 7¥i,b+n). 


i=1 
Estimation and prediction proceed in a manner similar to that in the binomial 
model. The posterior expectation of 0 is a convex combination of the prior 
expectation and the sample average: 
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a+) yi 
b+n 


b E. X yi 
b+nb b+n n 


EOY,- -- Yn] = 


e bis interpreted as the number of prior observations; 
e ais interpreted as the sum of counts from b prior observations. 


For large n, the information from the data dominates the prior information: 
n>>b>EOly,..-,yn] © J, VarlOlyi,.-., Yn] & Y/n. 


Predictions about additional data can be obtained with the posterior predic- 
tive distribution: 


pöly) = f DOGO: Y1,- --, Yn)P(Olyr, ---, Yn) d0 
0 
= J p(EIOpl Olyn,- -s Yn) 48 


= J dpois(ğ, #)dgamma(9,a+ X` yi, b+ n) dO 


= J 1 gie-0 (b + note ve eth yi—l eT (b+n)0 do 
y! P(a+ dy) 
— (b + note He m path yit 9-1 o7 (b+n+1)0 do. 


© (G+ (a+ Vy) Jo 


Evaluation of this complicated integral looks daunting, but it turns out that 
it can be done without any additional calculus. Let’s use what we know about 
the gamma density: 


1= J 6*-1e-8 d@ for any values a,b > 0 . 
o T(a) 


This means that 


E I 
| 92-1¢- do = a for any values a,b > 0 . 
0 


ba 
Now substitute in a + >> y; + J instead of a and b+ n + 1 instead of b to get 


© aty jell ine gg _ Tat} yiti) 
0 e dd = =. 
0 (b+n+ 1)e+2 yity 


After simplifying some of the algebra, this gives 


a = ZOtDmtI) a yr" ( 1 j 
PIW: Yn) = GFDL aty y) \ Pewee 1 b+n+1 
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for J € {0,1,2,...}. This is a negative binomial distribution with parameters 
(a+ >> yi,b+n), for which 


at 2 Yi 
E[Y [urs - -3 Yn] = 4 I> — = ElAlyn,.--, Ynl; 
~ a+ Soy b+n+1 
VarP lynsey] = SEBO EMED L vorlalys,... a] x (B+N-+1) 


b+n+1 


= El6lyi,---,Un 
[0lyr, ---, Yn] X RT 


Let’s try to obtain a deeper understanding of this formula for the predictive 
variance. Recall, the predictive variance is to some extent a measure of our 
posterior uncertainty about a new sample Y from the population. Uncertainty 
about Y stems from uncertainty about the population and the variability 
in sampling from the population. For large n, uncertainty about @ is small 
((b+n+1)/(b+n) ~ 1) and uncertainty about Y stems primarily from 
sampling variability, which for the Poisson model is equal to 6. For small n, 
uncertainty in Y also includes the uncertainty in 0, and so the total uncertainty 
is larger than just the sampling variability ((b + n + 1)/(b+n) > 1). 


3.2.2 Example: Birth rates 


Less than bachelor's Bachelor's or higher 
a 4 
a 2 
œ 
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Fig. 3.9. Numbers of children for the two groups. 


Over the course of the 1990s the General Social Survey gathered data on 
the educational attainment and number of children of 155 women who were 40 
years of age at the time of their participation in the survey. These women were 
in their 20s during the 1970s, a period of historically low fertility rates in the 
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United States. In this example we will compare the women with college degrees 
to those without in terms of their numbers of children. Let Yi1...,Yn,.1 
denote the numbers of children for the nı women without college degrees and 
Y1,2---,Yno,2 be the data for women with degrees. For this example, we will 
use the following sampling models: 


Yii ied s Yna 1/01 ~ Lid. Poisson(01) 
Yi2 sae , Yn2,2102 ~ iid. Poisson(62) 


The appropriateness of the Poisson model for these data will be examined in 
the next chapter. 

Empirical distributions for the data are displayed in Figure 3.9, and group 
sums and means are as follows: 


Less than bachelor’s: nı = 111, 577, Yi = 217, Yı =1.95 
Bachelor’s or higher: nə = 44, 5777, Yi2 = 66, Y= 1.50 


In the case where {01,02} ~ i.i.d. gamma(a = 2, b = 1), we have the following 
posterior distributions: 


Ai|{ny = 111, X Y; ı = 217} ~ gamma(2 + 217, 1 + 111) = gamma(219, 112) 


B2|{nz = 44, X_ Y; 2 = 66} ~ gamma(2 + 66,1 + 44) = gamma(68, 45) 


Posterior means, modes and 95% quantile-based confidence intervals for 61 
and @2 can be obtained from their gamma posterior distributions: 


> a<—2 ; b<—l # prior parameters 

> nl<—1l1l ; syl<—217 # data in group 1 

> n2<—44 ; sy2<—66 # data in group 2 

> (atsyl)/(b+n1) # posterior mean 
i LOKESETY 

> (atsyl —1)/(b+n1) # posterior mode 
1] 1.946429 


> qgamma( c(.025,.975),atsyl,btnl) +# posterior 95% CI 
ill) i. rodga 222 wAo7®) 


> (atsy2)/(btn2) 

il] e 

> (atsy2 —1)/(b+n2) 

1] 1.488889 

> qgamma( c(.025,.975) ,atsy2 ,b+n2) 
1] 12173437 1.890836 


Posterior densities for the population means of the two groups are shown in 
the first panel of Figure 3.10. The posterior indicates substantial evidence 
that 0; > 69. For example, Pr(@; > 02| X Yia = 217, >> Yi2 = 66) = 0.97. 
Now consider two randomly sampled individuals, one from each of the two 
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Fig. 3.10. Posterior distributions of mean birth rates (with the common prior 
distribution given by the dashed line), and posterior predictive distributions for 
number of children. 


populations. To what extent do we expect the one without the bachelor’s 
degree to have more children than the other? We can calculate the relevant 
probabilities exactly: The posterior predictive distributions for Y; and Y> are 
both negative binomial distributions and are plotted in the second panel of 
Figure 3.10. 


pS ye Wel) 


> dnbinom(y, size=(atsyl), mu=(a+syl1)/(b+n1)) 

1] 1.427473e—01 2.766518e—01 2.693071le—01 1.755660e—01 
5| 8.622930e—02 3.403387e—02 1.124423e—02 3.198421e—03 
9] 7.996053e—04 1.784763e—04 3.601115e—05 


dnbinom(y, size=(a+sy2), mu=(at+sy2)/(b+n2)) 

1] 2.243460e—01 3.316420e—01 2.487315e—01 1.261681e—01 
5] 4.868444e—02 1.524035e—02 4.030961le—03 9.263700e—04 
9] 1.887982e—04 3.465861le—05 5.801551e—06 


Notice that there is much more overlap between these two distributions than 
between the posterior distributions of 0q and 02. For example, Pr(Y > 
Yol X Yri = 217, 0 Y;a = 66) = A8 and Pr(Y¥, = P| Via = 217, DV Yi = 
66) = .22. The distinction between the events {91 > 02} and {f1 > Yo} is 
extremely important: Strong evidence of a difference between two populations 
does not mean that the difference itself is large. 
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3.3 Exponential families and conjugate priors 


The binomial and Poisson models discussed in this chapter are both in- 
stances of one-parameter exponential family models. A one-parameter ex- 
ponential family model is any model whose densities can be expressed as 
pyle) = h(y)e(d)e*™, where ¢ is the unknown parameter and t(y) is the 
sufficient statistic. Diaconis and Ylvisaker (1979) study conjugate prior dis- 
tributions for general exponential family models, and in particular prior dis- 
tributions of the form p(¢|no, to) = K(no, to)c(d)"®e""°?. Combining such 
prior information with information from Y1,..., Yn ~ i.i.d. p(y|@) gives the 
following posterior distribution: 


P(Ol¥15--+5 Yn) X p(d)p(y1,---, Ynl@) 


x e(o)" t” exp fo x 


noto + >, r) | 


i=l 
x p(g|no +n, noto + nt(y)), 


where t(y) = X` t(y:)/n. The similarity between the posterior and prior dis- 
tributions suggests that no can be interpreted as a “prior sample size” and to 
as a “prior guess” of t(Y). This interpretation can be made a bit more precise: 
Diaconis and Ylvisaker (1979) show that 


(see also Exercise 3.6), so tg represents the prior expected value of t(Y). The 
parameter no is a measure of how informative the prior is. There are a variety 
of ways of quantifying this, but perhaps the simplest is to note that, as a func- 
tion of ¢, p(d|no, to) has the same shape as a likelihood p(j1,..-,Gno|@) based 
on no “prior observations” 91, ..., Yn for which >> t(y;)/no = to. In this sense 
the prior distribution p(¢|no, to) contains the same amount of information that 
would be obtained from ng independent samples from the population. 


Example: Binomial model 


The exponential family representation of the binomial(@) model can be ob- 
tained from the density function for a single binary random variable: 


p(yl@) = (1 — 0)” 
(s 
= eY (1 + eb), 


where ġ = log[0/(1—8)] is the log-odds. The conjugate prior for ¢ is thus given 
by p(¢|no, to) x (1 + e?)~"%e"0%?, where to represents the prior expectation 
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of t(y) = y, or equivalently, to represents our prior probability that Y = 1. Us- 
ing the change of variables formula (Exercise 3.10), this translates into a prior 
distribution for @ such that p(6|no, to) x 0”%-1(1 — @)"eC—to)-1 which is a 
beta(noto, N0(1—to)) distribution. A weakly informative prior distribution can 
be obtained by setting to equal to our prior expectation and no = 1. If our prior 
expectation is 1/2, the resulting prior is a beta(1/2,1/2) distribution, which is 
equivalent to Jeffreys’ prior distribution (Exercise 3.11) for the binomial sam- 
pling model. Under the weakly informative beta(to, (1—to)) prior distribution, 
the posterior would be {6|y1,...,Yn} ~ beta(to + > yi, (1— to) +5 — ys). 


Example: Poisson model 


The Poisson(@) model can be shown to be an exponential family model with 


e t(y)=y; 
e ¢ġ=logð; 
© c(d) =exp(e~*). 


The conjugate prior distribution for ¢ is thus p(¢|no, to) = exp(noe~%)e"OY 
where to is the prior expectation of the population mean of Y. This translates 
into a prior density for 0 of the form p(O|no,to) œx Or°t%e—te-"09, which is 
a gamma(noto, no) density. A weakly informative prior distribution can be 
obtained with to set to the prior expectation of Y and no = 1, giving a 
gamma(to, 1) prior distribution. The posterior distribution under such a prior 
would be {O|y1,..-; Yn} ~ gamma(to + > yi, 1 +n). 


3.4 Discussion and further references 


The notion of conjugacy for classes of prior distributions was developed in 
Raiffa and Schlaifer (1961). Important results on conjugacy for exponential 
families appear in Diaconis and Ylvisaker (1979) and Diaconis and Ylvisaker 
(1985). The latter shows that any prior distribution may be approximated by 
a mixture of conjugate priors. 

Most authors refer to intervals of high posterior probability as “credible 
intervals” as opposed to confidence intervals. Doing so fails to recognize that 
Bayesian intervals do have frequentist coverage probabilities, often being very 
close to the specified Bayesian coverage level (Welch and Peers, 1963; Har- 
tigan, 1966; Severini, 1991). Some authors suggest that accurate frequentist 
coverage can be a guide for the construction of prior distributions (Tibshirani, 
1989; Sweeting, 1999, 2001). See also Kass and Wasserman (1996) for a review 
of formal methods for selecting prior distributions. 


A 


Monte Carlo approximation 


In the last chapter we saw examples in which a conjugate prior distribution for 
an unknown parameter @ led to a posterior distribution for which there were 
simple formulae for posterior means and variances. However, often we will 
want to summarize other aspects of a posterior distribution. For example, we 
may want to calculate Pr(@ € Aly,,...,Yn) for arbitrary sets A. Alternatively, 
we may be interested in means and standard deviations of some function of 0, 
or the predictive distribution of missing or unobserved data. When comparing 
two or more populations we may be interested in the posterior distribution 
of |81 — 62|, 01/02, or max{61,...,4}, all of which are functions of more 
than one parameter. Obtaining exact values for these posterior quantities can 
be difficult or impossible, but if we can generate random sample values of 
the parameters from their posterior distributions, then all of these posterior 
quantities of interest can be approximated to an arbitrary degree of precision 
using the Monte Carlo method. 


4.1 The Monte Carlo method 


In the last chapter we obtained the following posterior distributions for 
birthrates of women without and with bachelor’s degrees, respectively: 


111 
p(%1| X ¥i1 = 217) = dgamma(6,, 219, 112) 
i=l 
44 
p(O2| X ¥i2 = 66) = dgamma(O2, 68, 45) 
i=1 


Additionally, we modeled 6, and 62 as conditionally independent given the 
data. It was claimed that Pr(6; > 62| > Yi ıı = 217, X Yi, = 66) = 0.97. How 
was this probability calculated? From Chapter 2, we have 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_4, 
© Springer Science+Business Media, LLC 2009 
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Pr(0; > baly, 19+ Yna, 2) 


Oy 
=f 3 p91, baly, 19o Une, 2) d0zd0: 


= mi "deanas DIO: 112) x dgamma(02, 68,45) dð2d0ı 
o Jo 


7 Fars i [ : 0?180$7e—11201 -4502 bad 


There are a variety of ways to calculate this integral. It can be done with 
pencil and paper using results from calculus, and it can be calculated nu- 
merically in many mathematical software packages. However, the feasibility 
of these integration methods depends heavily on the particular details of this 
model, prior distribution and the probability statement that we are trying to 
calculate. As an alternative, in this text we will use an integration method for 
which the general principles and procedures remain relatively constant across 
a broad class of problems. The method, known as Monte Carlo approxima- 
tion, is based on random sampling and its implementation does not require a 
deep knowledge of calculus or numerical analysis. 

Let 0 be a parameter of interest and let y1,..., Yn be the numerical values 
of a sample from a distribution p(y1,..., Yn|@). Suppose we could sample some 
number S of independent, random -values from the posterior distribution 
ply, ose Yn): 

OY)... 0® wiid p(Oly1,..-5 Yn). 


Then the empirical distribution of the samples {@),...,6°)} would approx- 
imate p(@|y1,---,;Yn), with the approximation improving with increasing S. 
The empirical distribution of {9@,...,0°5)} is known as a Monte Carlo ap- 
proximation to p(0|y1,.--, Yn). Many computer languages and computing en- 
vironments have procedures for simulating this sampling process. For example, 
R has built-in functions to simulate i.i.d. samples from most of the distribu- 
tions we will use in this book. 

Figure 4.1 shows successive Monte Carlo approximations to the density 
of the gamma(68,45) distribution, along with the true density function for 
comparison. As we see, the empirical distribution of the Monte Carlo samples 
provides an increasingly close approximation to the true density as S gets 
larger. Additionally, let g(0) be (just about) any function. The law of large 
numbers says that if 0®,..., 609) are iid. samples from p(O|y1,...,%n), then 


S 


E910) EUO --s9] = f gO) Olgas.) d0 as S> o0. 


s=1 
This implies that as S — ov, 


° a 8 69) /S > Elly,- --,Ynl; 
e 5S (0°) — 6)? /(S — 1) > Var[4|y1,---, Yn]; 
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Fig. 4.1. Histograms and kernel density estimates of Monte Carlo approximations 
to the gamma(68,45) distribution, with the true density in gray. 


#(0) < c)/S > Pr(O < clys,-.- Yn); 

the empirical distribution of {9,...,@°)} > p(@ly1,---, Yn); 
the median of {9,..., a0} > 93/93 

the a-percentile of {8,..., 0°} > ba. 


Just about any aspect of a posterior distribution we may be interested in can 
be approximated arbitrarily exactly with a large enough Monte Carlo sample. 


Numerical evaluation 


We will first gain some familiarity and confidence with the Monte Carlo pro- 
cedure by comparing its approximations to a few posterior quantities that 
we can compute exactly (or nearly so) by other methods. Suppose we model 
Y,,..., Yn|0 as i.i.d. Poisson(@), and have a gamma(a, b) prior distribution for 
0. Having observed Yı = y1,..-, Yn = Yn, the posterior distribution is gamma 
(a+ > yi, b+n). For the college-educated population in the birthrate example, 
(a = 2,b=1) and (90 y; = 66,n = 44). 


Expectation: The posterior mean is (a+ > y;)/(b+n) = 68/45 = 1.51. Monte 
Carlo approximations to this for S € {10,100, 1000} can be obtained in R 
as follows: 


a<—2 ; b<—1 
sy<—66 ; n<—44 


theta .mcl0<rgamma(10,a+sy ,b+n) 
theta.mcl00<rgamma(100 ,a+sy , b+n) 
theta .mc1000<rgamma(1000 ,a+sy , b+n) 
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mean(theta.mcl0) 
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Results will vary depending on the seed of the random number generator. 


Probabilities: The posterior probability that {0 < 1.75} can be obtained to a 
high degree of precision in R with the command pgamma(1.75,a+sy,b+n) , 
which yields 0.8998. Using the simulated values of 0 from above, the cor- 
responding Monte Carlo approximations were: 


> mean(theta.mcl0<1.75) 

[1] 0.9 

> mean(theta.mcl00<1.75) 

[1] 0.94 

> mean(theta.mc1000 <1.75) 

[1] 0.899 

Quantiles: A 95% quantile-based confidence region can be obtained with 
qgamma(c(.025,.975),at+sy,b+n) , giving an interval of (1.173,1.891). Ap- 
proximate 95% confidence regions can also be obtained from the Monte 
Carlo samples: 


> quantile(theta.mcl0, c(.025,.975)) 
DJs 97.5% 

1.260291 1.750068 

> quantile(theta.mcl00, c(.025,.975)) 
2.5% 97.5% 

1.231646 1.813752 

> quantile(theta.mcl000, c(.025,.975)) 
DIYs 97.5% 

1.180194 1.892473 


Figure 4.2 shows the convergence of the Monte Carlo estimates to the cor- 
rect values graphically, based on cumulative estimates from a sequence of 
S = 1000 samples from the gamma(68,45) distribution. Such plots can help 
indicate when enough Monte Carlo samples have been made. Additionally, 
Monte Carlo standard errors can be obtained to assess the accuracy of approx- 
imations to posterior means: Letting 0 = . 0°) /S be the sample mean of 
the Monte Carlo samples, the Central Limit Theorem says that 0 is approxi- 
mately normally distributed with expectation E[6|y1,..., Yn] and standard de- 
viation equal to \/Var[6|y1,...Yn]/S. The Monte Carlo standard error is the 
approximation to this standard deviation: Letting 6? = Y (0°) — 0)? /(S — 1) 
be the Monte Carlo estimate of Var[O|y1,..., Yn], the Monte Carlo standard 
error is \/G?/S. An approximate 95% Monte Carlo confidence interval for the 
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Fig. 4.2. Estimates of the posterior mean, Pr(@ < 1.75|y1,..., Yn) and the 97.5% 
posterior quantile as a function of the number of Monte Carlo samples. Horizontal 
gray lines are the true values. 


posterior mean of 0 is Ô + 2,/ô? /S. Standard practice is to choose S to be 
large enough so that the Monte Carlo standard error is less than the preci- 
sion to which you want to report E[@|y1,..., Yn]. For example, suppose you 
had generated a Monte Carlo sample of size S = 100 for which the estimate 
of Var[6|y1,---,Yn] was 0.024. The approximate Monte Carlo standard error 
would then be ,/0.024/100 = 0.015. If you wanted the difference between 
E[6|y1,---;Yn] and its Monte Carlo estimate to be less than 0.01 with high 
probability, you would need to increase your Monte Carlo sample size so that 


2,/0.024/5 < 0.01, i.e. S > 960. 


4.2 Posterior inference for arbitrary functions 


Suppose we are interested in the posterior distribution of some computable 
function g(0) of 0. In the binomial model, for example, we are sometimes 
interested in the log odds: 


log odds(8) = log a 


The law of large numbers says that if we generate a sequence {0 ,9),...} 
from the posterior distribution of 0, then the average value of log > con- 
verges to Eflog lu. ..+;Yn]. However, we may also be interested in other 
aspects of the posterior distribution of y = log a. Fortunately, these too can 
be computed using a Monte Carlo approach: 
sample 0) ~ p(@ly1,---,Yn), compute 7 = g(@™) 
sample 0°) ~ p(@ly1,---,Yn), compute 7?) = g(@°)) 
. independently . 


sample 6°5) ~ p(O|y1,---,Yn), compute 5) = g(95)) 
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The sequence {y,..., (5) } constitutes S independent samples from p(7|y1,- - -; 
Yn), and so as S — co 


© 7= 8 Y/S > Ellin,- nls 


S s = 
yay )— Tye a 1) = Var|yly1, noe Yn); 
e the empirical distribution of {y™,..., 76} > p(qly1,.--, yn); 


as before. 
Example: Log-odds 


Fifty-four percent of the respondents in the 1998 General Social Survey re- 
ported their religious preference as Protestant, leaving non-Protestants in the 
minority. Respondents were also asked if they agreed with a Supreme Court 
ruling that prohibited state or local governments from requiring the reading 
of religious texts in public schools. Of the n = 860 individuals in the religious 
minority (non-Protestant), y = 441 (51%) said they agreed with the Supreme 
Court ruling, whereas 353 of the 1011 Protestants (35%) agreed with the 
ruling. 

Let 0 be the population proportion agreeing with the ruling in the minority 
population. Using a binomial sampling model and a uniform prior distribution, 
the posterior distribution of 0 is beta(442, 420). Using the Monte Carlo algo- 
rithm described above, we can obtain samples of the log-odds y = log|@/(1—6)] 
from both the prior distribution and the posterior distribution of y. In R, the 
Monte Carlo algorithm involves only a few commands: 


a<-l ; pail 
theta. prior .mc<—rbeta(10000,a,b) 
gamma. prior .mc<— log( theta.prior.mc/(1—theta.prior.mc) ) 


n0<—860—441 ; nl<—441 
theta. post .mc<—rbeta(10000 ,a+n1 , b+n0) 
gamma. post .mc<— log( theta. post.mc/(1—theta.post.mc) ) 


Using the density() function in R , we can plot smooth kernel density ap- 
proximations to these distributions, as shown in Figure 4.3. 


Example: Functions of two parameters 


Based on the prior distributions and the data in the birthrate example, the 
posterior distributions for the two educational groups are 


{0ily1,1;- -< , Ynı, 1} ~ gamma(219, 112) (women without bachelor’s degrees) 
{02|¥1,25--+;Ynso,2} ~ gamma(68,45) (women with bachelor’s degrees). 


There are a variety of ways to describe our knowledge about the difference 
between 6, and 02. For example, we may be interested in the numerical value 


4.2 Posterior inference for arbitrary functions 59 


So wn 4 
NAN 
© 
ae 
2| om 
S 4 
Sa ben 4 
Ke | È 
S Ta 
N 
O — 
© | 
So ee me 
2 T T T T T T T T T 
4 2 0 2 4 4 2 0 2 4 
Y Y 


Fig. 4.3. Monte Carlo approximations to the prior and posterior distributions of 
the log-odds. 


of Pr(01 > b2|Y1 1 = Y1,1;---, Yno,2 = Yno,2), or in the posterior distribution of 
01/02. Both of these quantities can be obtained with Monte Carlo sampling: 


sample 6!) ~ p(61| 0424 Yi = 217), sample 0$? ~ p(6o| a Y; 2 = 66) 
sample 6) ~ p(61| 514 Yi = 217), sample 0? ~ (02 544 Yi2 = 66) 


sample 6{°) ~ p(61| 0j24 Y;a = 217), sample 6°) ~ n(0>| ee Yio = 66). 


The sequence KOS 0), saa, (a6), of} consists of S independent sam- 
ples from the joint posterior distribution of 6; and 62, and can be used to 
make Monte Carlo approximations to posterior quantities of interest. For 
para ee > bə] ee Yii = 217 Dp 1 Yi,2 = 66) is approximated by 
5 Ls 1 10 > o$), where ia > y) is the indicator function which is 1 if 
x > y and zero otherwise. The approximation can be calculated in R with the 
following commands: 

> a<—2 ; b<—l 

> syl<—217 ; nl<—111 

> sy2<—66 ; n2<—44 


> thetal .mc<rgamma(10000,a+syl, b+n1) 
> theta2 .mc<—rgamma(10000,a+sy2, b+n2) 


> mean(thetal .mc>theta2.mc) 


[1] 0.9708 
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Additionally, if we were interested in the ratio of the means of the two groups, 
we could use the empirical distribution of fo) / o9, ae o / os) to ap- 
proximate the posterior distribution of 01/02. A Monte Carlo estimate of this 
posterior density is given in Figure 4.4. 
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Fig. 4.4. Monte Carlo estimate to the posterior predictive distribution of y = 01/62. 


4.3 Sampling from predictive distributions 


As described in Section 3.1, the predictive distribution of a random variable 
Y is a probability distribution for Y such that 


e known quantities have been conditioned on; 
e unknown quantities have been integrated out. 


For example, let Y be the number of children of a person who is sampled from 
the population of women aged 40 with a college degree. If we knew the true 
mean birthrate 0 of this population, we might describe our uncertainty about 
Y with a Poisson(@) distribution: 


Sampling model: Pr(Y = g|@) = p(g|@) = 0e’ /ğ! 


We cannot make predictions from this model, however, because we do not 
actually know 6. If we did not have any sample data from the population, our 
predictive distribution would be obtained by integrating out 0: 


Predictive model: Pr(Y = 9) = f p(g|0)p(0)d0 


In the case where 0 ~ gamma(a, b), we showed in the last chapter that this 
predictive distribution is the negative binomial(a, b) distribution. A predictive 
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distribution that integrates over unknown parameters but is not conditional 
on observed data is called a prior predictive distribution. Such a distribution 
can be useful in evaluating if a prior distribution for 0 actually translates 
into reasonable prior beliefs for observable data Y (see Exercise 7.4). After we 
have observed a sample Y;,..., Yn from the population, the relevant predictive 
distribution for a new observation becomes 


PY =9 = Piss Yn = Yn) = IEUAN Puy Pal DUG ional) d0 
= J p(G18)p(Blun, «+s n) d8. 


This is called a posterior predictive distribution, because it conditions on an 
observed dataset. In the case of a Poisson model with a gamma prior distri- 
bution, we showed in Chapter 3 that the posterior predictive distribution is 
negative binomial(a+ >> yi, b + n). 

In many modeling situations, we will be able to sample from p(0|y1,..-, Yn) 
and p(y|@), but p(gly1,---,Yn) will be too complicated to sample from di- 
rectly. In this situation we can sample from the posterior predictive distri- 
bution indirectly using a Monte Carlo procedure. Since p(§lyi,.--,;Yn) = 
J v(G\9)p(Oly1,---, Yn) dO, we see that p(Jly1,...,Yn) is the posterior expec- 
tation of p(ğ|0). To obtain the posterior predictive probability that Y is equal 
to some specific value y, we could just apply the Monte Carlo method of 
the previous section: Sample 0),...,0°) ~ iid. p(Oly1,-..,Yn), and then 
approximate p(jlyi,---,Yn) with oe p(g|@) /S. This procedure will work 
well if p(y|@) is discrete and we are interested in quantities that are easily com- 
puted from p(y|@). However, it will generally be useful to have a set of samples 
of Y from its posterior predictive distribution. Obtaining these samples can 
be done quite easily as follows: 


sample 0) ~ p(Oly1,-.-;Yn), sample 9 ~ p(gla) 
sample 0) ~ p(O|y1,---,Yn), sample 72) ~ p(l) 


sample 69) ~ p(6|y1,---,Ym), sample 7°) ~ p(gl9). 


The sequence {(6,7),..., (0, 7) } constitutes S independent samples from 
the joint posterior distribution of (6,Y), and the sequence {y,...,g°°)} 
constitutes S independent samples from the marginal posterior distribution 
of Y, which is the posterior predictive distribution. 


Example: Poisson model 


At the end of Chapter 3 it was reported that the predictive probability that 
an age-40 woman without a college degree would have more children than an 
age-40 woman with a degree was 0.48. To arrive at this answer exactly we 
would have to do the following doubly infinite sum: 
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Pri > Y| XC Vin = 217, XC ¥i2= 66) = 
XO So dnbinom(ġı, 219,112) x dnbinom(j2, 68, 45). 
yo=0 Jı =ğ2+1 


Alternatively, this sum can be approximated with Monte Carlo sampling. 
Since Yı and Yə are a posteriori independent, samples from their joint poste- 
rior distribution can be made by sampling values of each variable separately 
from their individual posterior distributions. Posterior predictive samples from 
the conjugate Poisson model can be generated as follows: 


sample 0 ~ gamma(ļa +` yi, b +n), sample 9 ~ Poisson(0®) 
sample 0) ~ gamma(a +` y;,b+ n), sample 7°) ~ Poisson(0®?)) 


sample 65) ~ gamma(a +` y;,b+n) sample 95) ~ Poisson(6()) . 


Monte Carlo samples from the posterior predictive distributions of our two 
educational groups can be obtained with just a few commands in R: 


> a<—2 ; b<—l 
> syl<—217 ; nil<—ill 
> sy2<—66 ; n2<—44 


thetal .mc<—rgamma(10000,a+sy1, b+n1) 
theta2 .mc<—rgamma(10000,a+sy2, b+n2) 
yl.mc<rpois (10000,thetal .mc) 
y2.mc<rpois (10000, theta2 .mc) 


Vv 


mean (y1 .mc>y2 .mc) 
[1] 0.4823 


Once we have generated these Monte Carlo samples from the posterior predic- 
tive distribution, we can use them again to calculate other posterior quantities 
of interest. For example, Figure 4.5 shows the Monte Carlo approximation to 
the posterior distribution of D = (Y1 — Y>), the difference in number of children 
between two individuals, one sampled from each of the two groups. 


4.4 Posterior predictive model checking 


Let’s consider for the moment the sample of 40-year-old women without a 
college degree. The empirical distribution of the number of children of these 
women, along with the corresponding posterior predictive distribution, is 
shown in the first panel of Figure 4.6. In this sample of n = 111 women, 
the number of women with exactly two children is 38, which is twice the 
number of women in the sample with one child. In contrast, this group’s pos- 
terior predictive distribution, shown in gray, suggests that the probability of 
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Fig. 4.5. The posterior predictive distribution of D = yi = Yo, the difference in 
the number of children of two randomly sampled women, one from each of the two 
educational populations. 


sampling a woman with two children is slightly less probable than sampling 
a woman with one (probabilities of 0.27 and 0.28, respectively). These two 
distributions seem to be in conflict. If the observed data have twice as many 
women with two children than one, why should we be predicting otherwise? 
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Fig. 4.6. Evaluation of model fit. The first panel shows the empirical and posterior 
predictive distributions of the number of children of women without a bachelor’s 
degree. The second panel shows the posterior predictive distribution of the empirical 
odds of having two children versus one child in a dataset of size n = 111. The 
observed odds are given in the short vertical line. 
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One explanation for the large number of women in the sample with two 
children is that it is a result of sampling variability: The empirical distribution 
of sampled data does not generally match exactly the distribution of the 
population from which the data were sampled, and in fact may look quite 
different if the sample size is small. A smooth population distribution can 
produce sample empirical distributions that are quite bumpy. In such cases, 
having a predictive distribution that smoothes over the bumps of the empirical 
distribution may be desirable. 

An alternative explanation for the large number of women in the sample 
with two children is that this is indeed a feature of the population, and the 
data are correctly reflecting this feature. In contrast, the Poisson model is 
unable to represent this feature of the population because there is no Poisson 
distribution that has such a sharp peak at y = 2. 

These explanations for the discrepancy between the empirical and predic- 
tive distributions can be assessed numerically with Monte Carlo simulation. 
For every vector y of length n = 111, let t(y) be the ratio of the number 
of 2’s in y to the number of 1’s, so for our observed data Yens; t(Yons) = 2- 
Now suppose we were to sample a different set of 111 women, obtaining a 
data vector Y of length 111 recording their number of children. What sort 


of values of t(Y) would we expect? Monte Carlo samples from the posterior 


predictive distribution of t(Y) can be obtained with the following procedure 
and R-code: 


For each s € {1,..., S}, 
1. sample 0°) ~ p(6/¥ = yous) 
(E) ~(s ~(s a Š 
2. sample = (gS ) err ms )) ~ iid. p(yl0%) 
3. compute t°) = (¥). 
A=) 2 lo<—il 
t .mc<—NULL 


for(s in 1:10000) { 
thetal<rgamma(1, atsyl, b+n1) 
yl.mc<-rpois(nl, thetal) 
t .mce<—c(t . mc, sum ( y1 .mc==2)/sum(y1.mc==1)) 


In this Monte Carlo sampling scheme, 
{0®,...,0(9)} are samples from the posterior distribution of 8; 
~ (1 ~ (S 
y' . ees ï‘ ‘4 are posterior predictive datasets, each of size n; 7 
{t@,...,t5)} are samples from the posterior predictive distribution of t(Y). 


A Monte Carlo approximation to the distribution of t(Y) is shown in the 
second panel of Figure 4.6, with the observed value t(y,,,) indicated with a 
short vertical line. Out of 10,000 Monte Carlo datasets, only about a half of 
a percent had values of t(y) that equaled or exceeded t(y,,,). This indicates 
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that our Poisson model is flawed: It predicts that we would hardly ever see a 
dataset that resembled our observed one in terms of t(y). If we were interested 
in making inference on the true probability distribution pirue(y) for each value 
of y, then the Poisson model would be inadequate, and we would have to 
consider a more complicated model (for example, a multinomial sampling 
model). However, a simple Poisson model may suffice if we are interested only 
in certain aspects of Pirue- For example, the predictive distribution generated 
by the Poisson model will have a mean that approximates the true population 
mean, even though pirue may not be a Poisson distribution. Additionally, for 
these data the sample mean and variance are similar, being 1.95 and 1.90 
respectively, suggesting that the Poisson model can represent both the mean 
and variance of the population. 

In terms of data description, we should at least make sure that our model 
generates predictive datasets Y that resemble the observed dataset in terms 
of features that are of interest. If this condition is not met, we may want to 
consider using a more complex model. However, an incorrect model can still 
provide correct inference for some aspects of the true population (White, 1982; 
Bunke and Milhaud, 1998; Kleijn and van der Vaart, 2006). For example, the 
Poisson model provides consistent estimation of the population mean, as well 
as accurate confidence intervals if the population mean is approximately equal 
to the variance. 


4.5 Discussion and further references 


The use of Monte Carlo methods is widespread in statistics and science in 
general. Rubinstein and Kroese (2008) cover Monte Carlo methods for a wide 
variety of statistical problems, and Robert and Casella (2004) include more 
coverage of Bayesian applications (and cover Markov chain Monte Carlo meth- 
ods as well). 

Using the posterior predictive distribution to assess model fit was sug- 
gested by Guttman (1967) and Rubin (1984), and is now common practice. 
In some problems, it is useful to evaluate goodness-of-fit using functions that 
depend on parameters as well as predicted data. This is discussed in Gelman 
et al (1996) and more recently in Johnson (2007). These types of posterior 
predictive checks have given rise to a notion of posterior predictive p-values, 
which despite their name, do not generally share the same frequentist prop- 
erties as p-values based on classical goodness-of-fit tests. This distinction is 
discussed in Bayarri and Berger (2000), who also consider alternative types of 
Bayesian goodness of fit probabilities to serve as a replacement for frequentist 
p-values. 


5 


The normal model 


Perhaps the most useful (or utilized) probability model for data analysis is the 
normal distribution. There are several reasons for this, one being the central 
limit theorem, and another being that the normal model is a simple model 
with separate parameters for the population mean and variance - two quan- 
tities that are often of primary interest. In this chapter we discuss some of 
the properties of the normal distribution, and show how to make posterior 
inference on the population mean and variance parameters. We also compare 
the sampling properties of the standard Bayesian estimator of the population 
mean to those of the unbiased sample mean. Lastly, we discuss the appro- 
priateness of the normal model when the underlying data are not normally 
distributed. 


5.1 The normal model 


A random variable Y is said to be normally distributed with mean 0 and 
variance g? > 0 if the density of Y is given by 


1 _1;y=6)2 
p(ylO,07) = Vro? 2( a ) ? 


Figure 5.1 shows normal density curves for a few values of 0 and a”. Some 
important things to remember about this distribution include that 


—00 < y < %. 


e the distribution is symmetric about 0, and the mode, median and mean 
are all equal to 0; 

e about 95% of the population lies within two standard deviations of the 
mean (more precisely, 1.96 standard deviations); 

e if X ~ normal(u, T°), Y ~ normal(ĝ, o°) and X and Y are independent, 
then aX + bY ~ normal(au + b0, a?7? + b?0o?); 

e the dnorm, rnorm, pnorm, and qnorm commands in R take the standard 
deviation ø as their argument, not the variance o?. Be very careful about 
this when using R - confusing ø with ø? can drastically change your results. 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_5, 
© Springer Science+Business Media, LLC 2009 
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> dnorm 
function (x, mean = 0, sd = 1, log = FALSE) 
.Internal(dnorm(x, mean, sd, log)) 


o0 
om 
aN — 6=2,.07=0.25 
S — 0=507=4 
= 0=7,0° =1 
b 
Ce 
D . = 
ae 
Q 


0.0 0.2 
| 


Fig. 5.1. Some normal densities. 


The importance of the normal distribution stems primarily from the cen- 
tral limit theorem, which says that under very general conditions, the sum (or 
mean) of a set of random variables is approximately normally distributed. In 
practice, this means that the normal sampling model will be appropriate for 
data that result from the additive effects of a large number of factors. 


Example: women’s height 


A study of 1,100 English families from 1893 to 1898 gathered height data 
on n = 1375 women over the age of 18. A histogram of these data is shown 
in Figure 5.2. The sample mean of these data is y = 63.75 and the sample 
standard deviation is s = 2.62 inches. One explanation for the variability in 
heights among these women is that the women were heterogeneous in terms of 
a number of factors controlling human growth, such as genetics, diet, disease, 
stress and so on. Variability in these factors among the women results in 
variability in their heights. Letting y; be the height in inches of woman i, a 
simple additive model for height might be 


yı =a +b x gene, + c x diet; + d x disease; +--+- 


Y2 = a + b X gene, + c x dietz + d x diseasez +- 


Yn = a + b x gene, +c x diet, + d x disease, +--- 
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where gene; might denote the presence of a particular height-promoting gene, 
diet; might measure some aspect of woman 7’s diet, and disease; might indicate 
if woman i had ever had a particular disease. Of course, there may be a 
large number of genes, diseases, dietary and other factors that contribute to a 
woman’s height. If the effects of these factors are approximately additive, then 
each height measurement y; is roughly equal to a linear combination of a large 
number of terms. For such situations, the central limit theorem says that the 
empirical distribution of y1,...,Yn will look like a normal distribution, and 
so the normal model provides an appropriate sampling model for the data. 


0.10 


0.05 
| 


0.00 


55 60 65 70 
height in inches 


Fig. 5.2. Height data and a normal density with 0 = 63.75 and o = 2.62. 


5.2 Inference for the mean, conditional on the variance 


Suppose our model is {Yi,..., ¥n|0,07} ~ iid. normal (0,07). Then the joint 
sampling density is given by 


pyr, slei ,Ynlð, a?) = J [ owls. o?) 


Expanding the quadratic term in the exponent, we see that p(y1,..., Yn|0,07) 
depends on y1, ..., Yn through 
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ae fy-0\? 1 0 0? 
D(H) = pe 25 Dens, 


i=l 


From this you can show that {9 y?, >> y;} make up a two-dimensional suffi- 
cient statistic. Knowing the values of these quantities is equivalent to knowing 
the values of y = X` y;/n and s? = Ð (yi — y)?/(n— 1), and so {y, s?} are also 
a sufficient statistic. 

Inference for this two-parameter model can be broken down into two one- 
parameter problems. We will begin with the problem of making inference for 
0 when o? is known, and use a conjugate prior distribution for 6. For any 
(conditional) prior distribution p(6|o7), the posterior distribution will satisfy 

plOlyr, .--, Yn, o”) x p(O\a”) x e7 zz ÈX (vi—0)? 
x p(6|o7) x ec (0—02)? 
Recall that a class of prior distributions is conjugate for a sampling model if 
the resulting posterior distribution is in the same class. From the calculation 
above, we see that if p(@|o7) is to be conjugate, it must include quadratic 
terms like e%(9-©2)", The simplest such class of probability densities on R 
is the normal family of densities, suggesting that if p(@|o7) is normal and 
Yis-++5Yn are iid. normal(6,07), then p(O|y1,...,Yn,07) is also a normal 
density. Let’s evaluate this claim: If @ ~ normal (po, 74), then 


poly, -e1 Yn; a?) = P(Olo*)v(y, my , Ynlð, 07) /plyn, ate Unle?) 
x P(O|o7) p(y, sis ,Ynlð, 0°) 
1 2 1 2 
x aA — Ho) fexp{—55 So (yi — 6)*}. 
Adding the terms in the exponents and ignoring the -1/2 for the moment, we 
have 


1 1 
7 (0? — 20 uo + wa) 4 a2 > y? — 20X. Yi + n0?) = a? — 2b0 + c, where 


1 r Ho 2.2 
a= y b= + » and c= ClUn; To; Uls Un) 
A a PUP (Ho, T0 0°, Y1,- -3 Yn) 
Now let’s see if p(0|o7, y1,...,Yn) takes the form of a normal density: 


1 
p(O|o7, Y1,- Yn) X exp{-3 (40? — 2b0)} 


= exp{ Le 2b0/a + b?/a”) 4 50? /a} 


2 
x exp{—5alo — b/a)?} 
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This function has exactly the same shape as a normal density curve, with 
1/va playing the role of the standard deviation and b/a playing the role of 
the mean. Since probability distributions are determined by their shape, this 
means that p(6|o7, y1, . -- , Yn) is indeed a normal density. We refer to the mean 
and variance of this density as un and 72, where 


1 n= 
4 1 b zalo + gaY 
T= == I and Hn = 7 = T 
a 4+ 4 a 4+ 4 


Combining information 


The (conditional) posterior parameters 7? and pz, combine the prior parame- 
ters Të and jo with terms from the data. 


e Posterior variance and precision: The formula for 1/7? is 


1 1 n 
es A 5.1 
T2 yet ga (5.1) 


and so the prior inverse variance is combined with the inverse of the data 
variance. Inverse variance is often referred to as the precision. For the 
normal model let, 

a? = 1/0? = sampling precision, i.e. how close the y;’s are to 0; 

7E = 1/TẸ = prior precision; 

72 = 1/7? = posterior precision. 
It is convenient to think about precision as the quantity of information on 
an additive scale. For the normal model, Equation 5.1 implies that 

= iG +n’, 

and so posterior information = prior information + data information. 

e Posterior mean: Notice that 


and so the posterior mean is a weighted average of the prior mean and 
the sample mean. The weight on the sample mean is n/o?, the sampling 
precision of the sample mean. The weight on the prior mean is 1/rê, the 
prior precision. If the prior mean were based on «Ko prior observations from 
the same (or similar) population as Y;,..., Yn, then we might want to set 
Tr = o7/ko, the variance of the mean of the prior observations. In this 
case, the formula for the posterior mean reduces to 


Ko n 


Hn Bta” Ko tn? 
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Prediction 


Consider predicting a new observation Y from the population after having 
observed (Y1 = y1,---; Yn = Yn). To find the predictive distribution, let’s use 
the following fact: 


{Y|0,07} ~ normal(6,0?) Ý =0 +ë, {0,07} ~ normal(0, o°). 


In other words, saying that Y is normal with mean @ is the same as saying 
Y is equal to 0 plus some mean-zero normally distributed noise. Using this 
result, let’s first compute the posterior mean and variance of Y: 


E(Y ly, c., Yn, o°] = EJO + lyr, -Yn o°] 
= F[Aly1, aa Mae | + Elély, aii Yn, o°] 


Var[Y |y, sara Ung o’] = Var|6 + člyi, -< , Yn; a’] 
= Var|6|y1,.-., Yn, "| + Varlély1,---5 Yn; o°| 


2.32 2 
=7, +O 


Recall from the beginning of the chapter that the sum of independent normal 
random variables is also normal. Therefore, since both @ and €, conditional on 
Y1s-+.,Yn and o7, are normally distributed, so is Y = 0 + č. The predictive 
distribution is therefore 


vies Yis- Yn ~ normal(Hn, Ta + o*) e 


It is worthwhile to have some intuition about the form of the variance of Y: In 
general, our uncertainty about a new sample Y is a function of our uncertainty 
about the center of the population (T2) as well as how variable the population 
is (77). As n — co we become more and more certain about where @ is, and 
the posterior variance 7? of 0 goes to zero. But certainty about 0 does not 
reduce the sampling variability ¢?, and so our uncertainty about Y never goes 


below o?. 


Example: Midge wing length 


Grogan and Wirth (1981) provide data on the wing length in millimeters of 
nine members of a species of midge (small, two-winged flies). From these nine 
measurements we wish to make inference on the population mean 0. Studies 
from other populations suggest that wing lengths are typically around 1.9 
mm, and so we set uo = 1.9. We also know that lengths must be positive, 
implying that 0 > 0. Therefore, ideally we would use a prior distribution for 
0 that has mass only on 0 > 0. We can approximate this restriction with a 
normal prior distribution for 0 as follows: Since for any normal distribution 
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most of the probability is within two standard deviations of the mean, we 
choose TÊ so that fo — 2 x To > 0, or equivalently To < 1.9/2 = 0.95. For now, 
we take To = 0.95, which somewhat overstates our prior uncertainty about 0. 

The observations in order of increasing magnitude are (1.64, 1.70, 1.72, 
1.74, 1.82, 1.82, 1.82, 1.90, 2.08), giving 7 = 1.804. Using the formulae above 
for un and 77, we have {6|y1,...,y9,07} ~ normal (un, TŻ), where 


abot z9 111 x 19+ 31.804 


Mn = ~ FZ = 9 
ati L+S 
1 1 

T= 


gee Mre 


o 


ty] 


If o? = s? = 0.017, then {6|y1,...,y9,07 = 0.017} ~ normal (1.805, 0.002). 
A 95% quantile-based confidence interval for 0 based on this distribution is 
(1.72, 1.89). However, this interval assumes that we are certain that o? = s?, 
when in fact s? is only a rough estimate of o? based on only nine observations. 
To get a more accurate representation of our information we need to account 
for the fact that g? is not known. 
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Fig. 5.3. Prior and conditional posterior distributions for the population mean wing 
length in the midge example. 


5.3 Joint inference for the mean and variance 


Bayesian inference for two or more unknown parameters is not conceptually 
different from the one-parameter case. For any joint prior distribution p(0, o°) 
for 0 and a7, posterior inference proceeds using Bayes’ rule: 
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p(O,07|y1, tty Yn) = p(y, ep Ynl9, 07) (8, 07) /pl, kewa Yn) ` 


As before, we will begin by developing a simple conjugate class of prior dis- 
tributions which make posterior calculations easy. 

Recall from our axioms of probability that a joint distribution for two 
quantities can be expressed as the product of a conditional probability and a 
marginal probability: 

p(0, o°) = p(6|o)p(o°) . 
In the last section, we saw that if o? were known, then a conjugate prior 
distribution for 6 was normal(uo, T6). Let’s consider the particular case in 
which T = 0?/ko: 


p(0,07) = p(A\a7)p(a?) = dnorm(9, po, To = 7//Ko) x plo’). 


In this case, the parameters uo and Kg can be interpreted as the mean and 
sample size from a set of prior observations. 

For o? we need a family of prior distributions that has support on (0, 00). 
One such family of distributions is the gamma family, as we used for the 
Poisson sampling model. Unfortunately, this family is not conjugate for the 
normal variance. However, the gamma family does turn out to be a conjugate 
class of densities for 1/a? (the precision). When using such a prior distribution 
we say that g? has an inverse-gamma distribution: 


precision = 1/0? ~ gamma<(a, b) 

variance = ø? ~ inverse-gamma(a, b) 
For interpretability later on, instead of using a and b we will parameterize this 
prior distribution as 


1/0? ~ gamma eS 59%): 


Under this parameterization, 


e Elo7|= repeat ; 
e mode[o?] = on ee, so mode[o?] < o < E[o?]; 


e Varļ|o?] is decreasing in vo. 


As we will see in a moment, we can interpret the prior parameters (og, vo) as 
the sample variance and sample size of prior observations. 


Posterior inference 
Suppose our prior distributions and sampling model are as follows: 
1/0? ~ gamma(v9/2, voog /2) 


lo? ~ normal( uo, o° /ko) 
Y,,.--,;Yn|0,07 ~ iid. normal (6,07). 
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Just as the prior distribution for ð and o? can be decomposed as p(0,07) = 
p(9|o7)p(o7), the posterior distribution can be similarly decomposed: 


P(O,07|yi,---,Yn) = pOl, yr- Yn) P(O7 lyr, -- +5 Yn)» 


The conditional distribution of 0 given the data and g? can be obtained using 
the results of the previous section: Plugging in 0?/ko for tẹ, we have 


{0lyi,---, Yn, o°} ~ normal(un, o” /Kn), where 
(Ko/o7) Ho + (n/o7)9 _ rouo + ny 
= a d = = . 
Kn = Ko + N and Ln Ko /o2 + njo? a 
Therefore, if zo is the mean of xo prior observations, then E[@|y1,...,Yn, 07] is 
the sample mean of the current and prior observations, and Var[6|y1,..-, Yn, 07] 


is ø? divided by the total number of observations, both prior and current. 
The posterior distribution of g? can be obtained by performing an inte- 
gration over the unknown value of 0: 


p(o7|y1,---,Yn) x p(o”)p(y1,---,Ynlo?) 
= p(o?) I p(t, ---s4nl9,02)p(6|o2) d0 


This integral can be done without much knowledge of calculus, but it is some- 
what tedious and is left as an exercise (Exercise 5.3). The result is that 


{1/o7|y1, tees Unb ae gamma(Vn/2, noZ /2), where 
Va = h Nn 
1 Kon 
o? = —|Y009 + (n — 1)s? + — (g — uo)”. 


m Kn 


These formulae suggest an interpretation of vo as a prior sample size, from 
which a prior sample variance of o? has been obtained. Recall that s? = 
X; (yi — ¥)?/(n — 1) is the sample variance, and (n — 1)s? is the sum of 
squared observations from the sample mean, which is often called the “sum 
of squares.” Similarly, we can think of voc and v,,02 as prior and posterior 
sums of squares, respectively. Multiplying both sides of the last equation by 
Vn almost gives us “posterior sum of squares equals prior sum of squares plus 
data sum of squares.” However, the third term in the last equation is a bit 
harder to understand - it says that a large value of (y — uo)? increases the 
posterior probability of a ine a”. This makes sense for our particular joint 
prior distribution for 6 and o?: If we to think of 4o as the sample mean of 
ko prior observations with variance o°, then = Lo)? is an estimate of o? 
and so we want to use the fetbemation that this term provides. For situations 
in which uo should not be thought of as the mean of prior observations, we 
will develop an alternative prior distribution in the next chapter. 
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Example 


Returning to the midge data, studies of other populations suggest that the true 
mean and standard deviation of our population under study should not be too 
far from 1.9 mm and 0.1 mm respectively, suggesting 9 = 1.9 and oå = 0.01. 
However, this population may be different from the others in terms of wing 
length, and so we choose Ko = vo = 1 so that our prior distributions are only 
weakly centered around these estimates from other populations. 

The sample mean and variance of our observed data are y = 1.804 and 
s? = 0.0169 (s = 0.130). From these values and the prior parameters, we 
compute Hn and oĉ: 


_ Kopo + ny _ 1.9+9 x 1.804 


Te = 1.814 
Kn 1+9 
1 = 
a3 = —[Yo05 + (n — 1)8? +" (G— po)?] 
0.010 + 0.135 + 0.008 
= i I0 u = 0.015. 


These calculations can be done with the following commands in R: 


# prior 
mu0<—-1.9 ; k0<—1 
s20<—.010 ; nud<—1 


# data 
Voc (l64 1b 70) eel 74> eee i 2 e190 808) 
n<-length(y) ; ybar<—mean(y) ; s2<-var(y) 


# posterior inference 

kn<—k04+n ; nun<—nu0-+n 

mun<— (k0*mu0 + n*xybar)/kn 

s2n<— (nu0xs20 +(n—1)*s2 +k0*n*(ybar—mu0)*2/(kn))/(nun) 


mun 
1.814 

2n 
0.015324 

qrt (s2n) 
0.1237901 


> 
[1] 
> 
[1] 
> § 
[1] 
Our joint posterior distribution is completely determined by the values un = 
1.814, Kk, = 10, c2 = 0.015, vn = 10, and can be expressed as 

{O|y1,.--;Yn; o°} ~ normal(1.814, 07/10), 

{1/o7|y1,.--,Yn} ~ gamma(10/2,10 x 0.015/2). 


Letting c? = 1/o7, a contour plot of the bivariate posterior density of (0,7) 
appears in the first panel of Figure 5.4. This plot was obtained by computing 
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dnorm(4;, Hn, 1/\/1067) x dgamma(G7, 10/2, 1002/2) for each pair of values 
(k, ©?) on a grid. Similarly, the second panel plots the joint posterior density 
of (0,07). Notice that the contours are more peaked as a function of @ for low 
values of g? than high values. 
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Fig. 5.4. Joint posterior distributions of (0,67) and (0,07). 


Monte Carlo sampling 


For many data analyses, interest primarily lies in estimating the popula- 
tion mean 0, and so we would like to calculate quantities like E[@|y1,..., yn], 
sd[Alyi,---; Yn], Pr(@1 < 02| y14,---; Yns,2), and so on. These quantities are 
all determined by the marginal posterior distribution of 0 given the data. But 
all we know (so far) is that the conditional distribution of 0 given the data 
and g? is normal, and that g? given the data is inverse-gamma. If we could 
generate marginal samples of 6, from p(@|y1,...,Yn), then we could use the 
Monte Carlo method to approximate the above quantities of interest. It turns 
out that this is quite easy to do by generating samples of 0 and øg? from their 
joint posterior distribution. Consider simulating parameter values using the 
following Monte Carlo procedure: 


21) ~ inverse gamma(vn /2, o2vn/2), 0) ~ normal(pin, 0? /Kn) 


075) ~ inverse gamma(vn /2, 0o2vn/2), 9°5) ~ normal (fin, 029) /Kin) « 


Note that each 6“) is sampled from its conditional distribution given the data 
and o? = o?*), This Monte Carlo procedure can be implemented in R with 
only two lines of code: 
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s2.postsample < 1/rgamma(10000, nun/2, s2n*nun/2 ) 
theta.postsample <— rnorm(10000, mun, sqrt(s2.postsample/kn) ) 


A sequence of pairs {(a2\),@M),..., (0265), @(5))} simulated using this pro- 
cedure are independent samples from the joint posterior distribution of 
p(O,07\y1,...,Yn). Additionally, the simulated sequence {0,...,0°5)} can 
be seen as independent samples from the marginal posterior distribution of 
p(Oly1,---;Yn), and so we use this sequence to make Monte Carlo approxi- 
mations to functions involving p(6|y1,...,Yn), as described in Chapter 4. It 
may seem confusing that each 6°)-value is referred to both as a sample from 
the conditional posterior distribution of 0 given o? and as a sample from 
the marginal posterior distribution of 0 given only the data. To alleviate this 
confusion, keep in mind that while 6“,...,6°5) are indeed each conditional 
samples, they are each conditional on different values of 07. Taken together, 
they constitute marginal samples of 0. 

Figure 5.5 shows samples from the joint posterior distribution of (0,07), 
as well as kernel density estimates of the marginal posterior distributions. 
Any posterior quantities of interest can be approximated from these Monte 
Carlo samples. For example, a 95% confidence interval can be obtained in R 
with quantile(theta.postsample,c (.025,.975)) , which gives an interval of (1.73, 
1.90). This is extremely close to (1.70, 1.90), a frequentist 95% confidence 
interval obtained from the t-test. There is a reason for this: It turns out that 
p(Oly1,---;Yn), the marginal posterior distribution of 6, can be obtained in a 
closed form. From this form, it can be shown that the posterior distribution 
of t(@) = ae, given ¥ and s?, has a t-distribution with vp + n degrees of 
freedom. If ko and vo are small, then the posterior distribution of ¢(@) will be 
very close to the t,_ 1 distribution. How small can Ko and vo be? 


Improper priors 


What if you want to “be Bayesian” so you can talk about things like Pr(@ < 
cl¥1,-+-;Yn) but want to “be objective” by not using any prior information? 
Since we have referred to ko and vo as prior sample sizes, it might seem that 
the smaller these parameters are, the more objective the estimates will be. So 
it is natural to wonder what happens to the posterior distribution as Ko and 
vo get smaller and smaller. The formulae for un and oĉ? are 


_ Kopo + ny 
n = S 
Ko +n 
1 Kon 
22 2 | 1 24 0 m 2 
on = rlo + (n= 1)? + 0- p00)" 
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This has led some to suggest the following “posterior distribution”: 


n nil E 
{1/0?lyn +-+} ~ gamma( Z, 24 Slv; — a) 
2 
o 
{0|o?, Y1,-++;Yn} ~ normal(y, 7) : 


Somewhat more formally, if we let (0,07) = 1/0? (which is not a probability 
density) and set p(0,07|y) x p(y|9,07) x p(0,07), we get the same “condi- 
tional distribution” for 0 but a gamma( =, X (yi — y)*) distribution for 
1/o? (Gelman et al (2004), Chapter 3). You can integrate this latter joint 


distribution over a? to show that 


eee Yn ~ a 
ain n-1 


It is interesting to compare this result to the sampling distribution of the 
t-statistic, conditional on 0 but unconditional on the data: 


Y- bnt 
s/n i 


The second statement says that, before you sample the data, your uncertainty 
about the scaled deviation of the sample mean Y from the population mean 6 
is represented with a t,,_; distribution. The first statement says that after you 
sample your data, your uncertainty is still represented with a t,_1 distribution. 
The difference is that before you sample your data, both Y and @ are unknown. 
After you sample your data, then Y = y is known and this provides us with 
information about 0. 

There are no proper prior probability distributions on (0,07) that will lead 
to the above t,,_; posterior distribution for 0, and so inference based on this 
posterior distribution is not formally Bayesian. However, sometimes taking 
limits like this leads to sensible answers: Theoretical results in Stein (1955) 
show that from a decision-theoretic point of view, any reasonable estimator is 
a Bayesian estimator or a limit of a sequence of Bayesian estimators, and that 
any Bayesian estimator is reasonable (the technical term here is admissible; 
see also Berger (1980)). 


5.4 Bias, variance and mean squared error 


A point estimator of an unknown parameter @ is a function that converts 
your data into a single element of the parameter space O. For example, in the 
case of a normal sampling model and conjugate prior distribution of the last 
section, the posterior mean estimator of @ is 

n _ Ko 


6 Lees Yn) = EfOlyi,.--59n] = =wy+(1- . 
b(Y1,---5 Yn) = E[Olyi,..-, Yn] an a wy + (1 — w) uo 
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Fig. 5.5. Monte Carlo samples from and estimates of the joint and marginal distri- 
butions of the population mean and variance. The vertical lines in the third plot give 
a 95% quantile-based posterior interval for 0 (gray), as well as the 95% confidence 
interval based on the t-statistic (black). 


The sampling properties of an estimator such as 6, refer to its behavior under 
hypothetically repeatable surveys or experiments. Let’s compare the sampling 
properties of Ê, to 6.(y1, ..+;Yn) = J, the sample mean, when the true value 
of the population mean is 69: 


E[6.|9 = 00] = 90, and we say that ĝe is “unbiased,” 
E[,|9 = 09] = w8o + (1 — w) ug, and if po Æ ĝo we say that 6, is “biased.” 
Bias refers to how close the center of mass of the sampling distribution of an 


estimator is to the true value. An unbiased estimator is an estimator with 
zero bias, which sounds desirable. However, bias does not tell us how far away 
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an estimate might be from the true value. For example, yı is an unbiased 
estimator of the population mean 49, but will generally be farther away from 
o than y. To evaluate how close an estimator Ê is likely to be to the true 
value ĝo, we might use the mean squared error (MSE). Letting m = E[6|4o], 
the MSE is 


MSE[6|40] = E[(8 — 40)*|60] 
E|(6 —m+m— 9)” |4o] 
= E[(6 — m)?|o] + 2E[(6 — m)(m — 6o)|8o] + El(m — 80)? l80]. 


Since m = E[6|6] it follows that E[6 — m|@|] = 0 and so the second term is 
zero. The first term is the variance of 6 and the third term is the square of 
the bias and so 
MSE/6|40] = Var[6|4o] + Bias?[6|6o]. 

This means that, before the data are gathered, the expected distance from 
the estimator to the true value depends on how close fo is to the center of the 
distribution of Ê (the bias), as well as how spread out the distribution is (the 
variance). Getting back to our comparison of 6, to be, the bias of 6. is ZETO, 
but 


, 2 

Var|[9.|0 = 00,07] = A whereas 
n 

‘ 2 

Var[6,|0 = 69,02] = w? x — < 7 

n n 

and so 6, has lower variability. Which one is better in terms of MSE? 


2 
# a oO 
MSE[6.|90] = E[(9- — 90)7|40] = = 


MSE[4;|90] = E[(6» — %)7|00] = Ewy — 0o) + (1 — w) (140 — 9) $7140] 
=w" x — + (1 — w)? (uo — 80)? 


With some algebra, you can show that MSE[6,|90] < MSE[Â; |60] if 


Some argue that if you know even just a little bit about the population you 
are about to sample from, you should be able to find values of uo and ko 
such that this inequality holds. In this case, you can construct a Bayesian 
estimator that will have a lower average squared distance to the truth than 
does the sample mean. For example, if you are pretty sure that your best 
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prior guess [Up is within two standard deviations of the true population mean, 
then if you pick ko = 1 you can be pretty sure that the Bayes estimator has a 
lower MSE. To make some of this more clear, let’s take a look at the sampling 
distributions of a few different estimators in the context of an example. 


Example: IQ scores 


Scoring on IQ tests is designed to produce a normal distribution with a mean 
of 100 and a standard deviation of 15 (a variance of 225) when applied to 
the general population. Now suppose we are to sample n individuals from a 
particular town in the United States and then estimate 6, the town-specific 
mean IQ score, based on the sample of size n. For Bayesian estimation, if 
we lack much information about the town in question, a natural choice of po 
would be uo = 100. 

Suppose that unknown to us the people in this town are extremely excep- 
tional and the true mean and standard deviation of IQ scores in the town are 
§ = 112 and ø = 13 (o? = 169). The MSEs of the estimators ĝe and 6) are 
then 


j i 2 169 
MSEJ[0e|00] = Var[0-] = T aa 
n n 
F 169 
MSE[Ô,|0o] = w — + (1 — w)?144, 
n 


where w = n/(ko +n). The ratio MSE[6,|9]/MSE[4.|9] is plotted in the first 
panel of Figure 5.6 as a function of n, for ko = 1, 2 and 3. 
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Fig. 5.6. Mean squared errors and sampling distributions of different estimators of 
the population mean IQ score. 
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Notice that when ko = 1 or 2 the Bayes estimate has lower MSE than 
the sample mean, especially when the sample size is low. This is because even 
though the prior guess uo = 100 is seemingly way off, it is not actually that far 
off when considering the uncertainty in our sample data. A choice of ko = 3 on 
the other hand puts more weight on the value of 100, and the corresponding 
estimator has a generally higher MSE than the sample mean. As n increases, 
the bias of each of the estimators shrinks to zero, and the MSEs converge to 
the common value of o?/n. The second panel of Figure 5.6 shows the sampling 
distributions of the sample mean when n = 10, as well as those of the three 
Bayes estimators corresponding to kg = 1, 2 and 3. This plot highlights the 
relative contributions of the bias and variance to the MSE. The sampling 
distribution of the sample mean is centered around the true value of 112, but 
is more spread out than any of the other distributions. The distribution of the 
ko = 1 estimator is not quite centered around the true mean, but its variance 
is low and so this estimator is closer on average to the truth than the sample 
mean. 


5.5 Prior specification based on expectations 


A p-dimensional exponential family model is a model whose densities can be 
written as pyle) = h(y)c(@) exp{¢’ t(y)}, where @ is the parameter to be 
estimated and t(y) = {ti(y),...,tp(y)} is the sufficient statistic. The normal 
model is a two-dimensional exponential family model, where 


e ty) =(y,y"), 
o = (8/07, —(207)~") and 
© c(h) = |d2|1/? exp{$7/(2¢2)}.- 


As was the case for one-parameter exponential family models in Section 
3.3, a conjugate prior distribution can be written in terms of @¢, giving 
p(d|no, to) x c()" exp(not p), where to = (to1,to2) = (E[Y],E[Y?]), the 
prior expectations of Y and Y?. If we reparameterize in terms of (0,07), we 
get 


p(6,07|mp, to) o [ery exp {ekti} x 


20? 


yea exp {=e — tox) \ 


20? 


The first term in the big braces is proportional to a normal(to1, o? /no) density, 
and the second is proportional to an inverse-gamma((no + 3)/2, no(t2 — t?)/2) 
density. To see how our prior parameters to, and tog should be determined, 
let’s consider the case where we have a prior expectation uo for the population 
mean (so E[Y] = E[E[Y|4]] = E[6] = po), and a prior expectation of for the 


p 
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population variance (so that E[Var[Y|6,0?]] = E[o?] = 02). Then we would 
set to; equal to uo, and determine toz from 
to2 = E[Y*] = E[E[Y?|9, 0°] 
E[o? + 0°] 
aG + 09/No + Ha = Co (No + 1)/no + Ho; 


II 


so no(to2 — t21) = (no + 1)oĝ. Thus our joint prior distribution for (0,707) 
would be 


blo? šā normal( uo, o° /no), and 


o? ~ inverse-gamma((no + 3)/2, (no + 1)09/2). 


For example, if our prior information is weak we might set nọ = 1, giving 
Olo? ~ normal( uo, 0°) and 1/0? ~ inverse-gamma(2, 02). It is easy to check 
that under this prior distribution the prior expectation of Y is 4o, and the 
prior expectation of Var[Y|0, 07] is oĉ, as desired. Given n i.i.d samples from 
the population, our posterior distribution under this prior would be 


Ho/0° +ny o° 
1/0? +n 'n+1 


{0|o7,y1,---5 Yn} ~ normal ( 


n 


(olni) ~ (240/208 + (n= 1)? + = po)?) 


5.6 The normal model for non-normal data 


People use the normal model all the time in situations where the data are not 
even close to being normally distributed. The justification of this is generally 
that while the sampling distribution of a single data point is not normal, the 
sampling distribution of the sample mean is close to normal. Let’s explore this 
distinction via a Monte Carlo sampling experiment: The 1998 General Social 
Survey (GSS) recorded the number of children for 921 women over the age of 
40. Let’s take these 921 women as our population, and consider estimating the 
mean number of children for this population (which is 2.42) based on random 
samples Y1,..., Yn of different sizes n. 

The true population distribution is plotted in the first panel of Figure 5.7, 
and is clearly not normal. For example, the distribution is discrete, bounded 
and skewed, whereas a normal distribution is continuous, unbounded and 
symmetric. Now let’s consider the sampling distribution of the sample mean 
Yn = +); Y; for n € {5,15,45}. This can be done using a Monte Carlo 
approximation as follows: For each n and some large value of S, simulate 
gw, ee JO, where each g is the sample mean of n samples taken without 
replacement from the 921 values in the population. The second panel of Figure 
5.7 shows the Monte Carlo approximations to the sampling distributions of 


5.6 The normal model for non-normal data 85 


2 
om 
= A 
S 
o 
= z 
S S7 
a 5 a 
Z Q 
> = 
= = 
S S 
012345678 0 1 2 3456 15 20 25 30 
y=number of children number of children y 


Fig. 5.7. A non-normal distribution and the distribution of its sample mean for 
n € {5,15,45}. The third panel shows a contour plot of the joint sampling density 
of {9, 8°} for the case n = 45. 


Y5,Yis and Ys. While the distribution of Y; looks a bit skewed, the sampling 
distributions of Yı5 and Y45 are hard to distinguish from normal distributions. 
This should not be too much of a surprise, as the central limit theorem tells 


us that 

p(y, 07) ~ dnorm(y, 0, \/a2/n), 
with the approximation becoming increasingly good as n gets larger. If the 
population variance g? were known, then an approximate posterior distri- 
bution of the population mean, conditional on the sample mean, could be 
obtained as 


POY, 07) x pA) x plo, o’) 

= p(0) x dnorm(9, 0, yo? /n). 
Of course o? is generally not known, but it is estimated by s?. The approx- 
imate posterior distribution of (0,07) conditional on the estimates (Vy, s?) is 
given by 

(9, 07|9, 8°) x p(8, 07) x p(y, 8°10, 07) 

= p(0, 0°) x p(g|0, 0°) x p(s7|9, 8, 07) 
x p(0, o°) x dnorm(y, 0, \/a2/n) x p(s?|9, 0,07). (5.2) 
Again, for large n, the approximation of p(y|9,07) by the normal density 
is generally a good one even if the population is not normally distributed. 
However, it is not clear what to put for p(s?|y,0,07). If we knew that the 


data were actually sampled from a normal distribution, then results from 
statistical theory would say that 


o? 
2 

0 
2 


=1 n-—1 
p(s*|y, 0,07) = dgamma(3?, n A 
2 20? 


Note that this result says that, for normal populations, Y and s? are inde- 
pendent. Using this sampling model for s? in Equation 5.2 results in exactly 
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the same conditional distribution for (0,07) as p(0,07|y1,---,Yn) assuming 
that the data are normally distributed. However, if the data are not normally 
distributed, then s? is not necessarily gamma-distributed or independent of 
y. For example, the third panel of Figure 5.7 shows the joint sampling distri- 
bution of {Y, s?} for the GSS population. Notice that Y and s? are positively 
correlated for this population, as is often the case for positively skewed pop- 
ulations. This suggests that the use of the posterior distribution in Equation 
5.2 for non-normal data could give misleading results about the joint distri- 
bution of {6,07}. However, the marginal posterior distribution of 0 based on 
5.2 can be remarkably accurate, even for non-normal data. The reasoning is 
as follows: The central limit theorem says that for large n 
ae — 0 


~ normal(0, 1), 


where ~ means “approximately distributed as.” Additionally, if n is suffi- 
ciently large, then s? ~ g? and so 

Ja — 0 
This should seem familiar: Recall from introductory statistics that for normal 
data, /n(Y — 0)/s has a t-distribution with n — 1 degrees of freedom. For 
large n, s? is very close to ø? and the tn—ı distribution is very close to a 
normal(0, 1) distribution. 

Even though the posterior distribution based on a normal model may pro- 
vide good inference for the population mean, the normal model can provide 
misleading results for other sample quantities. For example, every normal den- 
sity is symmetric and has a skew of zero, whereas our true population in the 
above example has a skew of E[(Y — 0)?]/o? = 0.89. Normal-model inference 
for samples from this population will underestimate the number of people in 
the right tail of the distribution, and so will provide poor estimates of the per- 
centage of people with large numbers of children. In general, using the normal 
model for non-normal data is reasonable if we are only interested in obtain- 
ing a posterior distribution for the population mean. For other population 
quantities the normal model can provide misleading results. 


~ normal(0, 1). 


5.7 Discussion and further references 


The normal sampling model can be justified in many different ways. For ex- 
ample, Lukacs (1942) shows that a characterizing feature of the normal distri- 
bution is that the sample mean and the sample variance are independent (see 
also Rao (1958)). From a subjective probability perspective, this suggests that 
if your beliefs about the sample mean are independent from those about the 
sample variance, then a normal model is appropriate. Also, among all distri- 
butions with a given mean @ and variance o”, the normal(0, ø?) distribution 
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is the most diffuse in terms of a measure known as entropy (see Jaynes, 2003, 
Chap.7, Chap.11). 

From a data analysis perspective, one justification of the normal sampling 
model is that, as described in Section 5.6, the sample mean will generally be 
approximately normally distributed due to the central limit theorem. Thus 
the normal model provides a reasonable sampling model for the sample mean, 
if not the sample data. Additionally, the normal model is a simple exponential 
family model with sufficient statistics equivalent to the sample mean and vari- 
ance. As a result, it will provide consistent estimation of the population mean 
and variance even if the underlying population is not normal. Additionally, 
confidence intervals for the population mean based on the normal model will 
generally be asymptotically correct (these results can be derived from those 
in White (1982)). However, the normal model may give inaccurate inference 
for other population quantities. 


6 


Posterior approximation with the Gibbs 
sampler 


For many multiparameter models the joint posterior distribution is non- 
standard and difficult to sample from directly. However, it is often the case 
that it is easy to sample from the full conditional distribution of each pa- 
rameter. In such cases, posterior approximation can be made with the Gibbs 
sampler, an iterative algorithm that constructs a dependent sequence of pa- 
rameter values whose distribution converges to the target joint posterior dis- 
tribution. In this chapter we outline the Gibbs sampler in the context of the 
normal model with a semiconjugate prior distribution, and discuss how well 
the method is able to approximate the posterior distribution. 


6.1 A semiconjugate prior distribution 


In the previous chapter we modeled our uncertainty about 0 as depending on 
2 
O°: 
p(|o”) = dnorm (9, Ho, 0//Ko) - 


This prior distribution relates the prior variance of 0 to the sampling variance 
of our data in such a way that uo can be thought of as Ko prior samples 
from the population. In some situations this makes sense, but in others we 
may want to specify our uncertainty about 0 as being independent of o°, 
so that p(@,07) = p(0) x p(o”). One such joint distribution is the following 
“semiconjugate” prior distribution: 


0 ~ normal( Ho, Te) 
1/o? ~ gamma(vo/2, voog /2) . 


If {Y1,...,¥n|0,07} ~ iid. normal(@,07), we showed in Section 5.2 that 
{O|o7,41,---;Yn} ~ normal( un, T2) with 


s =i 
i + ng/o? o (1 27 

= —— d => — — ʻ 
Hn 1/7é + n/o? oe A To i o? 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_6, 
© Springer Science+Business Media, LLC 2009 
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In the conjugate case where rẹ was proportional to o?, we showed that 
p(o7|y1,--.;Yn) was an inverse-gamma distribution, and that a Monte Carlo 
sample of {0,07} from their joint posterior distribution could be obtained by 
sampling 


1. a value 0?) from p(o?|y1,...,Yn), an inverse-gamma distribution, then 
2. a value 0°) from p(O|o?"), y,,..., Yn), a normal distribution. 


However, in the case where 7@ is not proportional to a”, the marginal density 
of 1/o? is not a gamma distribution, or any other standard distribution from 
which we can easily sample. 


6.2 Discrete approximations 


Letting ¢? = 1/o? be the precision, recall that the posterior distribution 
of {0,7} is equal to the joint distribution of {0,07,y1,..-, Yn}, divided by 
P(Y1,---; Yn), which does not depend on the parameters. Therefore the rela- 
tive posterior probabilities of one set of parameter values {61,47} to another 
{02,45} is directly computable: 


p01, 62 |yi,---5¥n) _ P(01, 67, Yi,- --Yn) 
P(O2,55|Y1,---,Yn)  p(02, 53, 1, --- Yn) 

01,5? eee 
pl02, 52, Y1- - Yn) 


/D(y1,-++>Yn) 
[P(Y -< Yn) 


The joint distribution is easy to compute as it was built out of standard prior 
and sampling distributions: 


p(0, ay, pesg Yn) =) p0, 67) x p(y, ainia Ynlð, 5°?) 
= dnorm(9, uo, To) X dgamma(?’, vo/2, voog /2) x 


] [ cnorm(y;, 0, 1/V6?). 
i=1 


A discrete approximation to the posterior distribution makes use of these 
facts by constructing a posterior distribution over a grid of parameter val- 
ues, based on relative posterior probabilities. This is done by evaluating 
p(O,67,41,-.-,Yn) on a two-dimensional grid of values of {0,7}. Letting 
{01,..., 0a} and {G7,...,a7,} be sequences of evenly spaced parameter val- 
ues, the discrete approximation to the posterior distribution assigns a poste- 
rior probability to each pair {6;, 47} on the grid, given by 
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P(Ox, © lY, EEA Yn) 
ae Lae P(g, Eh lY ve 1 Yn) 
D(Ox, 57, Y1, KA Yn)/P(Y1, « i Yn) 
= G H 5 
ee pai PCO gs Č}, Y1, ergs , Yn) /P(Y1, saa Yn) 
DOK, 7, Yi; sae sUn) 
> G A ~92 
Dat Spat PC os Fy, Yi, sya Yn) 


pplk, 7 |y, os Yn) = 


This is a real joint probability distribution for 0 € {01,...,0a} and õ? € 
{a7,...,7,}, in the sense that it sums to 1. In fact, it is the actual posterior 
distribution of {0,7} if the joint prior distribution for these parameters is 
discrete on this grid. 

Let’s try the approximation for the midge data from the previous chapter. 
Recall that our data were {n = 9,y = 1.804, s? = 0.017}. The conjugate prior 
distribution on 6 and g? of Chapter 5 required that the prior variance on 0 be 
o7/ko, i.e. proportional to the sampling variance. A small value of the sam- 
pling variance then has the possibly undesirable effect of reducing the nominal 
prior uncertainty for 0. In contrast, the semiconjugate prior distribution frees 
us from this constraint. Recall that we first suggested that the prior mean 
and standard deviation of 0 should be uo = 1.9 and 7 = .95, as this would 
put most of the prior mass on 0 > 0, which we know to be true. For o?, let’s 
use prior parameters of vo = 1 and of = 0.01. 

The R -code below evaluates p(0, &?|y1, - . - , Yn) on a 100x 100 grid of evenly 
spaced parameter values, with 0 € {1.505,1.510,...,1.995,2.00} and a? € 
{1.75,3.5,..., 173.25, 175.0}. The first panel of Figure 6.1 gives the discrete 
approximation to the joint distribution of {0,47}. Marginal and conditional 
posterior distributions for 0 and G? can be obtained from the approximation 
to the joint distribution with simple arithmetic. For example, 


H 
po(Orlyr -< Yn) = X po (0r, hlyn -+ Yn) 
h=1 

The resulting discrete approximations to the marginal posterior distributions 
of 0 and G? are shown in the second and third panels of Figure 6.1. 
in =n 0S 0s ee ea 
veel Gd AL ee a OT AO Be | 
G<—100 ; H<—100 


mean. grid<—seq (1.505 ,2.00, length=G) 
prec. grid<—seq (1.75 ,175, length=H) 
post . grid<—matrix (nrow=G, ncol=H) 


tore (gy im 1) { 
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tora aim SIE) of 
post. grid[g,h]<— 
dnorm (mean. grid [g], mu0, sqrt(t20)) x 


dgamma( prec. grid [h], nu0/2, s20*nu0/2 ) x 
prod (dnorm (y ,mean. grid[g],1/sqrt(prec. grid [h]))) 


} 
} 


post . grid<—post. grid/sum(post. grid) 
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Fig. 6.1. Joint and marginal posterior distributions based on a discrete approxima- 
tion. 


Evaluation of this two-parameter posterior distribution at 100 values of 
each parameter required a grid of size 100x100 = 100°. In general, to construct 
a similarly fine approximation for a p-dimensional posterior distribution we 
would need a p-dimensional grid containing 100? posterior probabilities. This 
means that discrete approximations will only be feasible for densities having 
a small number of parameters. 


6.3 Sampling from the conditional distributions 


Suppose for the moment you knew the value of 6. The conditional distribution 
of ? given 6 and {y1,...,Yn} is 


p(G7|9, y1,--- Yn) X ply,- -Yn 9,57) 
= p(Y1,--+,Ynl9,57)p(0|57)p(a?) . 


If 6 and G? are independent in the prior distribution, then p(0|¢7) = p(@) and 
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p(G7|9, 1, ---5 Yn) ) x p(y,- --, Ynlð, a *)p (67) 


(e "yn? exp{—3? i — 8) P2) > 


3 


R 


k vo/2— l exp{-ő? soo /2} 
g2 jer ly exp{— a? x [voog +) ui i — 0)°]/2}. 


This is the form of a Zarama density, and so evidently {07]0,41,---,Yn} ~ 
inverse-gamma(Vp, /2, Uo? (0) /2), where 


1 
Vn = Vo +n ’ o2 (0) = Un [voog + ns? (0)] , 


n 


and s2 (0) = X` (yi — 0)? /n, the unbiased estimate of o? if 6 were known. This 


means that we can easily sample directly from p(o7|6,y1,.--, Yn), as well as 
from p(6|o7, y1,---,Yn) as shown at the beginning of the chapter. However, 
we do not yet have a way to sample directly from p(0,07|y1,..., Yn). Can 


we use the full conditional distributions to sample from the joint posterior 
distribution? 

Suppose we were given a7“), a single sample from the marginal posterior 
distribution p(o7|y1,.--, Yn). Then we could sample 


g) ~ p(9|ar" ) Yis: Yn) 


and {8 ,o?@0} would be a sample from the joint distribution of {6,07}. 
Additionally, 9“) can be considered a sample from the marginal distribution 
p(O|y1,---,Yn). From this 6-value, we can generate 


g?l?) i p(o?|a™, Yis- sUn): 


But since 00) is a sample from the marginal distribution of 8, and 0?) is 


a sample from the conditional distribution of ø? given 0, then 10%, 0?@)} 
is also a sample from the joint distribution of {0,07}. This in turn means 
that o?) is a sample from the marginal distribution p(o?|y1,..., Yn), which 
then could be used to generate a new sample 6°), and so on. It seems that 
the two conditional distributions could be used to generate samples from the 
joint distribution, if only we had a o?“) from which to start. 


6.4 Gibbs sampling 


The distributions p(6|o?, y1,..-,; Yn) and p(o7|9, y1,..-, Yn) are called the full 
conditional distributions of 0 and g? respectively, as they are each a condi- 
tional distribution of a parameter given everything else. Let’s make the iter- 
ative sampling idea described in the previous paragraph more precise. Given 
a current state of the parameters ¢) = {6°),5?()}, we generate a new state 
as follows: 
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1. sample 0+) ~ p(0|a?),y1,...,4n); 
2. sample 6266+) ~ p(?|O64) a, ..., Yn); 
3. let pet) = {o(st)), goer. 


This algorithm is called the Gibbs sampler, and generates a dependent se- 
quence of our parameters {¢@), 6@,...,@°)}. The R-code to perform this 
sampling scheme for the normal model with the semiconjugate prior distribu- 
tion is as follows: 


HHH data 
mean.y<—mean(y) ; var.y<-var(y) ; n<—length(y) 


THE 


dee starting values 

S<-—1000 

PHI<matrix (nrow=S , ncol=2) 
PHI[1,]<—phi<—c( mean.y, 1/var.y) 
THA 


ttt Gibbs sampling 
set .seed (1) 
tor (Ss im 2S 


# generate a new theta value from its full conditional 
mun<— ( mu0/t20 + n*xmean.y*xphi[2] ) / ( 1/t20 + nxphi[2] ) 
t2n<— 1/( 1/t20 + nxphi[2] ) 

phi{1]<—rnorm(1, mun, sqrt(t2n) ) 


# generate a new 1/sigma’2 value from its full conditional 
nun<— nu0+n 

s2n< (nu0«s20 + (n—1)*xvar.y + n*(mean.y—phi[1])*2 ) /nun 
phi[2])<— rgamma(1, nun/2, nun*s2n/2) 


PHI|s,|<—phi } 
THA 


In this code, we have used the identity 
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The reason for writing the code this way is because s? and y do not change 
with each new 6-value, and computing (n — 1)s? + n(y — 0)? is faster than 
having to recompute ` ;_; (yi — 9)? at each iteration. 
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Fig. 6.2. The first 5, 15 and 100 iterations of a Gibbs sampler. 


Using the midge data from the previous chapter and the prior distribu- 
tions described above, a Gibbs sampler consisting of 1,000 iterations was con- 
structed. Figure 6.2 plots the first 5, 15 and 100 simulated values, and the 
first panel of Figure 6.3 plots the 1,000 values over the contours of the discrete 
approximation to p(0,a7|y1,..-., Yn). The second and third panels of Figure 
6.3 give density estimates of the distributions of the simulated values of 6 and 
õ?. Finally, let’s find some empirical quantiles of our Gibbs samples: 


det CI for population mean 

e quantile (PRI) , L-e (C025 ,.5 ,.975)) 
2.5% 50% 97.5% 

1.707282 1.804348 1.901129 


de CI for population precision 
= gmaminle (THR 2] ,c(.025,.5, .975)))) 
2.5% 50% 97.5% 
17.48020 53.62511 129.20020 


det CI for population standard deviation 

> guantiile (1 / scrie (Phil ,2]|),€(.025,.5, 975) 
DoD 50% 97.5% 

0.08797701 0.13655763 0.23918408 


The empirical distribution of these Gibbs samples very closely resembles the 
discrete approximation to their posterior distribution, as can be seen by com- 
paring Figures 6.1 and 6.3. This gives some indication that the Gibbs sampling 
procedure is a valid method for approximating p(9,07|y1,..-, Yn): 
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Fig. 6.3. The first panel shows 1,000 samples from the Gibbs sampler, plotted over 
the contours of the discrete approximation. The second and third panels give kernel 
density estimates to the distributions of Gibbs samples of 0 and G?. Vertical gray 
bars on the second plot indicate 2.5% and 97.5% quantiles of the Gibbs samples 
of 0, while nearly identical black vertical bars indicate the 95% confidence interval 
based on the t-test. 


6.5 General properties of the Gibbs sampler 


Suppose you have a vector of parameters @ = {¢1,..., p}, and your in- 
formation about @ is measured with p(@) = p(¢1,...,¢p). For example, in 
the normal model @ = {6,07}, and the probability measure of interest is 


p(9,07|41,..-,4n). Given a starting point 6 = {go , Hes gO, the Gibbs 
sampler generates 6“) from 6“~ as follows: 


1. sample pi? a ploile”, o”, aa =D) 
s s s—1 s—l 
2. sample O a pipl”, C er e )) 


p. sample (3) ~ P(dplo, {s) TA $) . 


This algorithm generates a dependent sequence of vectors: 


bo = {o,..., 9M} 
b = {$?),..., 62} 


py {be en A}. 
In this sequence, ø“ depends on d,...,@%7) only through o9, Le, 
o“ is conditionally independent of 6©,...,6°7? given øT». This is 
called the Markov property, and so the sequence is called a Markov chain. 
Under some conditions that will be met for all of the models discussed in this 
text, 
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Pr(p'*) € A) > | pid) db asso. 
A 


In words, the sampling distribution of gp) approaches the target distribution 
as s — oo, no matter what the starting value p is (although some starting 
values will get you to the target sooner than others). More importantly, for 
most functions g of interest, 


S 
5 96) + Ell) = | oA) do as S00. (6.1) 


This means we can approximate E[g(@)| with the sample average of {g(@), 
a g(o)}, just as in Monte Carlo approximation. For this reason, we call 
such approximations Markov chain Monte Carlo (MCMC) approximations, 
and the procedure an MCMC algorithm. In the context of the semiconjugate 
normal model, Equation 6.1 implies that the joint distribution of {(@, 02), 

. ., (9009) g2(1000))} is approximately equal to p(9,07| Y1,---, Ym) , and that 


1 1000 
ElO|y1,---. Yn] = Fogg DOW = 1.804, and 
s=l 


Pr(0 € [1.71, 1.90]|y1,---+Yn) 0.95. 


We will discuss practical aspects of MCMC in the context of specific models 
in the next section and in the next several chapters. 


Distinguishing parameter estimation from posterior approximation 


A Bayesian data analysis using Monte Carlo methods often involves a confus- 
ing array of sampling procedures and probability distributions. With this in 
mind it is helpful to distinguish the part of the data analysis which is statisti- 
cal from that which is numerical approximation. Recall from Chapter 1 that 
the necessary ingredients of a Bayesian data analysis are 


1. Model specification: a collection of probability distributions {p(y|¢),¢ € 
P} which should represent the sampling distribution of your data for some 
value of @ € ®; 

2. Prior specification: a probability distribution p(¢), ideally representing 
someone’s prior information about which parameter values are likely to 
describe the sampling distribution. 


Once these items are specified and the data have been gathered, the posterior 
p(dly) is completely determined. It is given by 


(opulo) _ __p()p(yle) 
p(y) J p(d)p(yld) de’ 


and so in a sense there is no more modeling or estimation. All that is left is 


p(dly) = £ 
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3. Posterior summary: a description of the posterior distribution p(¢|y), done 
in terms of particular quantities of interest such as posterior means, me- 
dians, modes, predictive probabilities and confidence regions. 


For many models, p(¢|y) is complicated, hard to write down, and so on. In 
these cases, a useful way to “look at” p(d|y) is by studying Monte Carlo 
samples from p(d|y). Thus, Monte Carlo and MCMC sampling algorithms 


e are not models, 
e they do not generate “more information” than is in y and p(¢), 
e they are simply “ways of looking at” p(d|y). 


For example, if we have Monte Carlo samples eM, ..., 0) that are approxi- 
mate draws from p(¢|y), then these samples help describe p(¢|y): 


+> 60) = f dp(dly) do 
LE(S < c) © Pr(¢ < ely) = f°. ploy) do. 


and so on. To keep this distinction in mind, it is useful to reserve the word 
estimation to describe how we use p(¢|y) to make inference about ¢, and to 
use the word approximation to describe the use of Monte Carlo procedures to 
approximate integrals. 


6.6 Introduction to MCMC diagnostics 


The purpose of Monte Carlo or Markov chain Monte Carlo approximation is 
to obtain a sequence of parameter values {@,...,5)} such that 


S 
5 (0) = | OPA) ao. 


for any functions g of interest. In other words, we want the empirical average 
of {9(¢P),...,g(¢))} to approximate the expected value of g(@) under a 
target probability distribution p(¢) (in Bayesian inference, the target distri- 
bution is usually the posterior distribution). In order for this to be a good 
approximation for a wide range of functions g, we need the empirical dis- 
tribution of the simulated sequence {¢“,...,¢°5)} to look like the target 
distribution p(¢). Monte Carlo and Markov chain Monte Carlo are two ways 
of generating such a sequence. Monte Carlo simulation, in which we gener- 
ate independent samples from the target distribution, is in some sense the 
“gold standard.” Independent MC samples automatically create a sequence 
that is oar of p(¢): The probability that ¢“) € A for any set A is 
Japlo) dd. This is true for every s € {1,...,S} and conditionally or uncon- 
ditionally on the other values in the sequence. This is not true for MCMC 
samples, in which case all we are sure of is that 
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lim Pr(¢“® € A) = | ple) dé. 
s—co A 


Let’s explore the differences between MC and MCMC with a simple ex- 
ample. Our target distribution will be the joint probability distribution of 
two variables: a discrete variable ô € {1,2,3} and a continuous variable 
0 € R. The target density for this example will be defined as {Pr( = 
1), Pr(é = 2), Pr(d = 3)} = (.45, .10, .45) and p(@|o) = dnorm(9, us, 75), where 
(11, H2, 3) = (—3,0,3) and (o7, 03,02) = (1/3,1/3,1/3). This is a mixture 
of three normal densities, where we might think of ô as being a group mem- 
bership variable and (15,03) as the population mean and variance for group 
ô. A plot of the exact marginal density of 0, p(@) = >> p(6|d)p(6), appears in 
the black lines of Figure 6.4. Notice that there are three modes representing 
the three different group means. 


0.4 


0.3 


p(®) 
0.2 


0.1 


0.0 
l 


-6 —4 =2 0 2 4 6 
8 


Fig. 6.4. A mixture of normal densities and a Monte Carlo approximation. 


It is very easy to obtain independent Monte Carlo samples from the joint 
distribution of ¢ = (6,6). First, a value of ô is sampled from its marginal 
distribution, then the value is plugged into p(6|d), from which a value of 0 
is sampled. The sampled pair (8,0) represents a sample from the joint dis- 
tribution of p(d,0) = p(d)p(0|o). The empirical distribution of the 6-samples 
provides an approximation to the marginal distribution p(@) = X` p(0|d)p(0). 
A histogram of 1,000 Monte Carlo 6-values generated in this way is shown in 
Figure 6.4. The empirical distribution of the Monte Carlo samples looks a lot 
like p(6). 

It is also straightforward to construct a Gibbs sampler for ¢ = (4,0). A 
Gibbs sampler would alternately sample values of 0 and 6 from their full condi- 
tional distributions. The full conditional distribution of @ is already provided, 
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and using Bayes’ rule we can show that the full conditional distribution of 6 
is given by 

Pr(ô = d) x dnorm(O, pa, ca) 
aoe Pr(ô = d) x dnorm(9, pa, ca) 


Pr(ô = d|@) = , for d € {1,2,3}. 


The first panel of Figure 6.5 shows a histogram of 1,000 MCMC values of 0 
generated with the Gibbs sampler. Notice that the empirical distribution of 
the MCMC samples gives a poor approximation to p(@). Values of 6 near -3 
are underrepresented, whereas values near zero and +3 are overrepresented. 
What went wrong? A plot of the 6-values versus iteration number in the second 
panel of the figure tells the story. The @-values get “stuck” in certain regions, 
and rarely move among the three regions represented by the three values of 
u. The technical term for this “stickiness” is autocorrelation, or correlation 
between consecutive values of the chain. In this Gibbs sampler, if we have a 
value of 0 near 0 for example, then the next value of 6 is likely to be 2. If 6 is 
2, then the next value of @ is likely to be near 0, resulting in a high degree of 
positive correlation between consecutive 6-values in the chain. 

Isn’t the Gibbs sampler guaranteed to eventually provide a good approx- 
imation to p(@)? It is, but “eventually” can be a very long time in some 
situations. The first panel of Figure 6.6 indicates that our approximation has 
greatly improved after using 10,000 iterations of the Gibbs sampler, although 
it is still somewhat inadequate. 
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Fig. 6.5. Histogram and traceplot of 1,000 Gibbs samples. 


In the case of a generic parameter ¢ and target distribution p(d), it is 
helpful to think of the sequence {¢“,..., 0) } as the trajectory of a particle 
ġ moving around the parameter space. In terms of MCMC integral approxi- 


6.6 Introduction to MCMC diagnostics 101 


p(®) 
0.2 03 0.4 


0.1 


0.0 
l 


-6 -4 -2 0 2 4 6 0 2000 4000 6000 8000 
0 iteration 


Fig. 6.6. Histogram and traceplot of 10,000 Gibbs samples. 


mation, the critical thing is that the amount of time the particle spends in a 
given set A is proportional to the target probability f} p(¢) dd. 

Now suppose A,, A and As are three disjoint subsets of the parameter 
space, with Pr(Az) < Pr(Ai) ~ Pr(A3) (these could be, for example, the 
regions near the three modes of the normal mixture distribution above). In 
terms of the integral approximation, this means that we want the particle to 
spend little time in Ag, and about the same amount of time in A, as in A3. 
Since in general we do not know p(@) (otherwise we would not be trying to 
approximate it), it is possible that we would accidentally start our Markov 
chain in Ag. In this case, it is critical that the number of iterations S is large 
enough so that the particle has a chance to 


1. move out of Az and into higher probability regions, and 
2. move between A, and A3, and any other sets of high probability. 


The technical term for attaining item 1 is to say that the chain has achieved 
stationarity or has converged. If your Markov chain starts off in a region of 
the parameter space that has high probability, then convergence generally is 
not a big issue. If you do not know if you are starting off in a good region, 
assessing convergence is fraught with epistemological problems. In general, 
you cannot know for sure if your chain has converged. But sometimes you 
can know if your chain has not converged, so we at least check for this latter 
possibility. One thing to check for is stationarity, or that samples taken in 
one part of the chain have a similar distribution to samples taken in other 
parts. For the normal model with semiconjugate prior distributions from the 
previous section, stationarity is achieved quite quickly and is not a big issue. 
However, for some highly parameterized models that we will see later on, the 
autocorrelation in the chain is high, good starting values can be hard to find 
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and it can take a long time to get to stationarity. In these cases we need to 
run the MCMC sampler for a very long time. 

Item 2 above relates to how quickly the particle moves around the param- 
eter space, which is sometimes called the speed of mixing. An independent 
MC sampler has perfect mixing: It has zero autocorrelation and can jump be- 
tween different regions of the parameter space in one step. As we have seen in 
the example above, an MCMC sampler might have poor mixing, take a long 
time between jumps to different parts of the parameter space and have a high 
degree of autocorrelation. How does the correlation of the MCMC samples 
affect posterior approximation? Suppose we want to approximate the integral 

= f p(¢) dọ = ¢o using the empirical distribution of {@,..., 6}. 
If the ¢-values are independent Monte Carlo samples from p(@), then the 
variance of @ = >> 6) /S is 


Var[ġ] 
S ? 


where Varje] = f ¢?p(¢) do — 3. Recall from Chapter 4 that the square 
root of Varmc[¢] is the Monte Carlo standard error, and is a measure of how 
well we expect ¢ to approximate the integral | ¢p(@) dọ. If we were to rerun 
the MC approximation procedure many times, perhaps with different starting 
values or random number generators, we expect that ġo, the true value of the 
integral, would be contained within the interval ¢ + 2\/Varyc[@] for roughly 
95% of the MC approximations. The width of this interval is 4 x \/Varmo[¢], 
and we can make this as small as we want by generating more MC samples. 

What if we use an MCMC algorithm such as the Gibbs sampler? As can 
be seen in Figures 6.5 and 6.6, consecutive MCMC samples o) and gt) 
can be positively correlated. Assuming stationarity has been achieved, the 
expected squared er from the MCMC integral approximation ¢ to the 
target po = f dp(¢) dọ is the MCMC variance, and is given by 


Varmomc|¢] = La po)”] 


= E4 soe 
= LEDE — 60)? + (6 — 606 — 60) 


Varmc[¢] = E[(¢ — ¢0)?] = 


s=l1 sft 
1 S 
=z Eo - Fa L Eg do)(O — ¢o)] 
s=1 sAt 
= Varmolé a El (o 9 po)(p® — $o )]- 
s#t 


So the MCMC variance is equal to the MC variance plus a term that depends 
on the correlation of samples within the Markov chain. This term is generally 
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positive and so the MCMC variance is higher than the MC variance, meaning 
that we expect the MCMC approximation to be further away from ġo than the 
MC approximation is. The higher the autocorrelation in the chain, the larger 
the MCMC variance and the worse the approximation is. To assess how much 
correlation there is in the chain we often compute the sample autocorrelation 
function. For a generic sequence of numbers {¢1,..., s}, the lag-t autocor- 
relation function estimates the correlation between elements of the sequence 
that are t steps apart: 


S_—t a2 1 (bs m: (dst — $) 
a fı = ? 
Li zH Dls — 6)? 


which is computed by the R-function acf . For the sequence of 10,000 6-values 
plotted in Figure 6.6, the lag-10 autocorrelation is 0.93, and the lag-50 auto- 
correlation is 0.812. A Markov chain with such a high autocorrelation moves 
around the parameter space slowly, taking a long time to achieve the correct 
balance among the different regions of the parameter space. The higher the 
autocorrelation, the more MCMC samples we need to attain a given level 
of precision for our approximation. One way to measure this is to calcu- 
late the effective sample size for an MCMC sequence, using the R-command 

effectiveSize in the “coda” package. The effective sample size function esti- 
mates the value Seg such that 


Var[¢] 
Sor , 


Varmemc[¢] = 


so that Se can be interpreted as the number of independent Monte Carlo 
samples necessary to give the same precision as the MCMC samples. For the 
normal mixture density example above, the effective sample size of the 10,000 
Gibbs samples of @ is 18.42, indicating that the precision of the MCMC ap- 
proximation to E[6] is as good as the precision that would have been obtained 
by only about 18 independent samples of 6. 

There is a large literature on the practical implementation and assessment 
of Gibbs sampling and MCMC approximation. Much insight can be gained 
by hands-on experience supplemented by reading books and articles. A good 
article to start with is “Practical Markov chain Monte Carlo” (Geyer, 1992), 
which includes a discussion by many researchers and a large variety of view- 
points on and techniques for MCMC approximation. 


MCMC diagnostics for the semiconjugate normal analysis 


We now assess the Markov chain of 0 and g? values generated by the Gibbs 
sampler in Section 6.4. Figure 6.7 plots the values of these two parame- 
ters in sequential order, and seems to indicate immediate convergence and 
a low degree of autocorrelation. The lag-1 autocorrelation for the sequence 
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{9% ..., 90009) is 0.031, which is essentially zero for approximation pur- 
poses. The effective sample size for this sequence is computed in R to be 1,000. 
The lag-1 autocorrelation for the o?-values is 0.147, with an effective sample 
size of 742. While not quite as good as an independently sampled sequence 
of parameter values, the Gibbs sampler for this model and prior distribution 
performs quite well. 
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Fig. 6.7. Traceplots for @ and o°”. 


6.7 Discussion and further references 


The term “Gibbs sampling” was coined by Geman and Geman (1984) in their 
paper on image analysis, but the algorithm appears earlier in the context 
of spatial statistics, for example, Besag (1974) or Ripley (1979). However, 
the general utility of the Gibbs sampler for Bayesian data analysis was not 
fully realized until the late 1980s (Gelfand and Smith, 1990). See Robert and 
Casella (2008) for a historical review. 

Assessing the convergence of the Gibbs sampler and the accuracy of the 
MCMC approximation is difficult. Several authors have come up with con- 
vergence diagnostics (Gelman and Rubin, 1992; Geweke, 1992; Raftery and 
Lewis, 1992), although these can only highlight problems and not guarantee 
a good approximation (Geyer, 1992). 


T 


The multivariate normal model 


Up until now all of our statistical models have been univariate models, that 
is, models for a single measurement on each member of a sample of individuals 
or each run of a repeated experiment. However, datasets are frequently multi- 
variate, having multiple measurements for each individual or experiment. This 
chapter covers what is perhaps the most useful model for multivariate data, 
the multivariate normal model, which allows us to jointly estimate population 
means, variances and correlations of a collection of variables. After first cal- 
culating posterior distributions under semiconjugate prior distributions, we 
show how the multivariate normal model can be used to impute data that are 
missing at random. 


7.1 The multivariate normal density 
Example: Reading comprehension 


A sample of twenty-two children are given reading comprehension tests before 
and after receiving a particular instructional method. Each student 7 will then 
have two scores, Y; ı and Y; 2 denoting the pre- and post-instructional scores 
respectively. We denote each student’s pair of scores as a 2 x 1 vector Y;, so 


that 
Y.= Yi _ score on first test 
a Yi,2 score on second test / ` 


Things we might be interested in include the population mean 9, 


Biv) = (Bi) = (a) 


and the population covariance matrix X, 


y= Gwy)< E[¥?]—E[M]? — E[M Ye] -E[MJE[Ye]\ _ ( of o 
ENY] — E[YiJB[¥2] — E[Y?] — E[¥2]? 012 o3 J’ 

P.D. Hoff, A First Course in Bayesian Statistical Methods, 

Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_7, 

© Springer Science+Business Media, LLC 2009 
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where the expectations above represent the unknown population averages. 
Having information about 0 and X may help us in assessing the effectiveness 
of the teaching method, possibly evaluated with 02 — 01, or the consistency of 
the reading comprehension test, which could be evaluated with the correlation 
coefficient p12 = 01,2/ y 0202. 


The multivariate normal density 


Notice that 0 and X are both functions of population moments, or population 
averages of powers of Yj and Y2. In particular, 0 and X are functions of first- 
and second-order moments: 


first-order moments: E[Yi], E[Y2] 
second-order moments: E[Y?], E[Y; Yo], E[Y7] 


Recall from Chapter 5 that a univariate normal model describes a population 
in terms of its mean and variance (0,07), or equivalently its first two moments 
(E[Y] = 0,E[Y?] = o? + 67). The analogous model for describing first- and 
second-order moments of multivariate data is the multivariate normal model. 
We say a p-dimensional data vector Y has a multivariate normal distribution 
if its sampling density is given by 


p(ylO, X) = (20) ?P |2|"? exp{—(y — 8)" X (y — 8) /2} 


where 
0 o? yaz T 
yı 1 1 21,2 1,p 
2 
Yo 02 01,2 03 *** O2p 
2 
Yp 0p C1,p aes ee op 


Calculating this density requires a few operations involving matrix algebra. 
For a matrix A, the value of |A| is called the determinant of A, and measures 
how “big” A is. The inverse of A is the matrix A~! such that AA~* is equal 
to the identity matrix Ip, the p x p matrix that has ones for its diagonal entries 
but is otherwise zero. For a p x 1 vector b, b7 is its transpose, and is simply 
the 1 x p vector of the same values. Finally, the vector-matrix product bv A 
is equal to the 1 x p vector (S04 _, bjaj,1,---, 04-1 6j4j,p), and the value of 
b” Ab is the single number pa k= bjbkaj,k. Fortunately, R can compute 
all of these quantities for us, as we shall see in the forthcoming example code. 

Figure 7.1 gives contour plots and 30 samples from each of three different 
two-dimensional multivariate normal densities. In each one © = (50,50)", 
o? = 64, o2 = 144, but the value of 01,2 varies from plot to plot, with o1,2 = 
—48 for the left density, 0 for the middle and +48 for the density on the right 
(giving correlations of -.5, 0 and +.5 respectively). An interesting feature of 
the multivariate normal distribution is that the marginal distribution of each 


variable Y; is a univariate normal distribution, with mean 6; and variance oF. 
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This means that the marginal distributions for Y; from the three populations 
in Figure 7.1 are identical (the same holds for Y2). The only thing that differs 
across the three populations is the relationship between Yı and Y>, which is 
controlled by the covariance parameter 01,2. 
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Fig. 7.1. Multivariate normal samples and densities. 


7.2 A semiconjugate prior distribution for the mean 


Recall from Chapters 5 and 6 that if Y1,..., Yn are independent samples from 
a univariate normal population, then a convenient conjugate prior distribution 
for the population mean is also univariate normal. Similarly, a convenient prior 
distribution for the multivariate mean @ is a multivariate normal distribution, 
which we will parameterize as 


p(@) = multivariate normal(pp, Ao), 
where jt) and Ag are the prior mean and variance of 0, respectively. What is 
the full conditional distribution of 0, given y,,...,y,, and X? In the univariate 
case, having normal prior and sampling distributions resulted in a normal full 
conditional distribution for the population mean. Let’s see if this result holds 


for the multivariate case. We begin by examining the prior distribution as a 
function of 0: 


_ z 1 7 
p(9) = (2r)? A exp{—5 (0 — Ho)” A0 (8 — Ho) } 
(oe ee _ oe 
= (2n)~P/2| A|" exp{—50" Ap 19407 Ap tte — z.o Ao 1 uo} 


1 
x exp{—59' 450 + 07 Ap | Ho} 


1 
exp{—59' A18 + 0" bi}, (7.1) 
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where Ag = Ap 1 and bọ = Ag l ftg: Conversely, Equation 7.1 says that if a 
random vector @ has a density on R? that is proportional to exp{ —0T A90 /2+ 
6" b} for some matrix A and vector b, then @ must have a multivariate normal 
distribution with covariance A~! and mean A~'D. 

If our sampling model is that {Y1,..., Y n|0, X} are i.i.d. multivariate 
normal(@, X), then similar calculations show that the joint sampling density 
of the observed vectors Y4,...,Yn is 


n 


P(Y»: +: Ynl9, X) = [Ten riz? exp{—(y; — 9)7 2 *(y; — 0)/2} 


= (2n)-"?/?|5|-"/?exp{—5 D (yi — 8) E-"(ys — 0)} 


w=1 
1 
x exp{—50" AiO + 07 by}, (7.2) 


where Ay = nX™!, b} = nX~'y and y is the vector of variable-specific 


averages Y = (4 70" Yi- 4 Dj Yip). Combining Equations 7.1 and 


7.2 gives 
1 1 
P(O\Y1,---5Yns =) x exp{—59" A09 + 07 bo} x exp{—59' Ai@ + 07 by} 


1 
= exp{—59' And +07b,}, where (7.3) 
An = Ao + Ai = Ap’ +n! and 
bn = bo + by = Ap’ py + n= "9. 
From the comments in the previous paragraph, Equation 7.3 implies that 


the conditional distribution of 0 therefore must be a multivariate normal 
distribution with covariance AZ’ and mean Abn, so 


Cov|6|¥1,--+)Yn. X] = An = (Ap + nX)! (7.4) 
EJOlYi -Yn X] = Hn = (Ap) +n (Ap ey t+ n2 ty) (7.5) 
p(Oly,,---, Yn, X) = multivariate normal(p,,, An). (7.6) 


It looks a bit complicated, but can be made more understandable by analogy 
with the univariate normal case: Equation 7.4 says that posterior precision, or 
inverse variance, is the sum of the prior precision and the data precision, just as 
in the univariate normal case. Similarly, Equation 7.5 says that the posterior 
expectation is a weighted average of the prior expectation and the sample 
mean. Notice that, since the sample mean is consistent for the population 
mean, the posterior mean also will be consistent for the population mean 
even if the true distribution of the data is not multivariate normal. 
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7.3 The inverse-Wishart distribution 
Just as a variance g? must be positive, a variance-covariance matrix X must 
be positive definite, meaning that 
x’ Sx > 0 for all vectors æ. 


Positive definiteness guarantees that o? > 0 for all j and that all correlations 
are between -1 and 1. Another requirement of our covariance matrix is that it 
is symmetric, which means that oj, = ok, j. Any valid prior distribution for 
X must put all of its probability mass on this complicated set of symmetric, 
positive definite matrices. How can we formulate such a prior distribution? 


Empirical covariance matrices 


The sum of squares matrix of a collection of multivariate vectors 21,...,2Zn 
is given by 


n 
X ziz? = Z" Z, 
i=1 


where Z is the n x p matrix whose ith row is zT. Recall from matrix algebra 
that since z; can be thought of as a p x 1 matrix, z;z7 is the following p x p 
matrix: 


2 
Zi, 1  74,17i,2 °° * 24,12%4,p 
2 
T Ži 2il i2 |0 %i,2%i,p 
Zizi = . 
ON — 2 
Zipi, l Zi,pŽi2 07 Zip 


If the z;’s are samples from a population with zero mean, we can think of 
the matrix ziz? /n as the contribution of vector z; to the estimate of the 
covariance matrix of all of the observations. In this mean-zero case, if we 
divide Z”Z by n, we get a sample covariance matrix, an unbiased estimator 
of the population covariance matrix: 


T 
[Z* Z]; = $ Din Z = Sjj = GF 


1 

n 

lj7Ty7). _ 1 n E heh eh iam, Week 

= 12 Z) j,k = n Lig Ži, j i,k = Sj,k - 


If n > p and the z;’s are linearly independent, then Z’Z will be positive 
definite and symmetric. This suggests the following construction of a “ran- 
dom” covariance matrix: For a given positive integer vo and a p x p covariance 
matrix o, 


1. sample 21,...,2,, ~ iid. multivariate normal(0, o); 
2. calculate Z7Z = °°, zi27. 


We can repeat this procedure over and over again, generating matrices 
Z! ae : LZ. The population distribution of these sum of squares matri- 
ces is called a Wishart distribution with parameters (vo, o), which has the 
following properties: 
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e Ifv >p, then ZZ is positive definite with probability 1. 
e Z’Z is symmetric with probability 1. 
EJZ? Z] = voo. 


The Wishart distribution is a multivariate analogue of the gamma distribution 
(recall that if z is a mean-zero univariate normal random variable, then 2° is a 
gamma random variable). In the univariate normal model, our prior distribu- 
tion for the precision 1/0? is a gamma distribution, and our full conditional 
distribution for the variance is an inverse-gamma distribution. Similarly, it 
turns out that the Wishart distribution is a semi-conjugate prior distribution 
for the precision matrix X71, and so the inverse-Wishart distribution is our 
semi-conjugate prior distribution for the covariance matrix X. With a slight 
reparameterization, to sample a covariance matrix X from an inverse- Wishart 
distribution we perform the following steps: 


1. sample 21,...,2 ) ~ iid. multivariate normal(0, Sas 
2. calculate Z7Z = S°%°, 2,27; 
3. set X = (Z?Z)-1. 
Under this simulation scheme, the precision matrix Y~+ has a Wishart(vp, Sp ') 
distribution and the covariance matrix X has an inverse-Wishart(v, S9") dis- 
tribution. The expectations of Xt and X are 


: mo ee a ree So. 

Yo —p—1 
If we are confident that the true covariance matrix is near some covariance 
matrix Xo, then we might choose vo to be large and set So = (vo — p— 1) Xo, 
making the distribution of X concentrated around Xp. On the other hand, 
choosing vo = p+ 2 and Sg = Xo makes X only loosely centered around Xo. 


Full conditional distribution of the covariance matriz 


The inverse-Wishart(v, Sg ') density is given by 
=i 


P(E) = |22 nEs J] roti- x 


j=1 
|Z ote+1)/2 x exp{—tr(SoX7t)/2}. (7.7) 


The normalizing constant is quite intimidating. Fortunately we will only have 
to work with the second line of the equation. The expression “tr” stands for 
trace and for a square p x p matrix A, tr(A) = ye ajj, the sum of the 
diagonal elements. 

We now need to combine the above prior distribution with the sampling 
distribution for Y1,..., Y n: 
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PCY +++ YnlO, X) = (2m) -7P/?|5|-*/? exp{— X (y; — 0) (y; — 0)/2}. 
i=1 
(7.8) 
An interesting result from matrix algebra is that the sum 2 bl Ab; = 
tr(BTBA), where B is the matrix whose kth row is b}. This means that the 
term in the exponent of Equation 7.8 can be expressed as 


nm 


Cr — 0)" 5! (y; — 0) = tr(SpX~*), where 


i=1 
n 
So = Soy — 0)(y; — 6)”. 
i=1 
The matrix Sọ is the residual sum of squares matrix for the vectors y1,..-, Yn 


if the population mean is presumed to be @. Conditional on 0, 18, provides 
an unbiased estimate of the true covariance matrix Cov[Y] (more generally, 
when @ is not conditioned on the sample covariance matrix is X` (y; — y)(y; — 
y)! /(n — 1) and is an unbiased estimate of X). Using the above result to 
combine Equations 7.7 and 7.8 gives the conditional distribution of X: 


P(X|y4, Ca) Yn: 0) 
X p(X) x PY; ra -Yn|8, X) 
oc (127+ +D/2 exp{-tr(S037?)/2}) x (15177 exp{—tr(5057)/2}) 
= |Z| otete +1)/2 exp{—tr([So + Sg] Z-1)/2}. 
Thus we have 
{DS y1,---;Yn, 0} ~ inverse-Wishart (vo + n, [So + So]7*). (7.9) 


Hopefully this result seems somewhat intuitive: We can think of vp) +n as the 
“posterior sample size,” being the sum of the “prior sample size” vo and the 
data sample size. Similarly, So + Sọ can be thought of as the “prior” residual 
sum of squares plus the residual sum of squares from the data. Additionally, 
the conditional expectation of the population covariance matrix is 


1 


E[X|y1,---5Yn 9] = noraga H So) 


—p-l 1 1 
= WTE So +4 2 So 


Yvtn—p—lu—p-1l Yotn—-p-—in 


and so the conditional expectation can be seen as a weighted average of the 
prior expectation and the unbiased estimator. Because it can be shown that Sọ 
converges to the true population covariance matrix, the posterior expectation 
of X is a consistent estimator of the population covariance, even if the true 
population distribution is not multivariate normal. 
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7.4 Gibbs sampling of the mean and covariance 


In the last two sections we showed that 


{9|y1,---;Y,, X} ~ multivariate normal(y,,, An) 
{Tlyq,---,Yn, 9} ~ inverse-Wishart(vn, S7"), 


where {An, Hn} are defined in Equations 7.4 and 7.5, Vn = vo + n and S, = 
So + Sọ. These full conditional distributions can be used to construct a Gibbs 
sampler, providing us with an MCMC approximation to the joint posterior 
distribution p(0, X|y4,---, Yn). Given a starting value 3), the Gibbs sampler 
generates {9°*) D+} from {A%), X(9} via the following two steps: 


1. Sample 6+) from its full conditional distribution: 
a) compute p,, and An from y,,...,y,, and XC); 
b) sample 9+) ~ multivariate normal(p.,,, An). 
2. Sample X(+) from its full conditional distribution: 
a) compute S„ from y,,...,y,, and 9+); 
b) sample 5+) ~ inverse-Wishart(vp + n, S7). 


Steps 1.a and 2.a highlight the fact that {u4,„, An} depend on the value of X, 
and that S,, depends on the value of 0, and so these quantities need to be 
recalculated at every iteration of the sampler. 


Example: Reading comprehension 


Let’s return to the example from the beginning of the chapter in which each 
of 22 children were given two reading comprehension exams, one before a 
certain type of instruction and one after. We’ll model these 22 pairs of scores as 
i.i.d. samples from a multivariate normal distribution. The exam was designed 
to give average scores of around 50 out of 100, so wọ = (50,50)? would 
be a good choice for our prior expectation. Since the true mean cannot be 
below 0 or above 100, it is desirable to use a prior variance for @ that puts 
little probability outside of this range. We’ll take the prior variances on 6; 
and 02 to be X61 = Aj2 = (50/2)? = 625, so that the prior probability 
Pr(@; ¢ [0,100]) is only 0.05. Finally, since the two exams are measuring 
similar things, whatever the true values of 6, and 62 are it is probable that 
they are close. We can reflect this with a prior correlation of 0.5, so that 
Ai,2 = 312.5. As for the prior distribution on X, some of the same logic about 
the range of exam scores applies. We’ll take So to be the same as Ap, but only 
loosely center X around this value by taking vo = p+ 2 = 4. 


mud<c (50,50) 
LO<matrix(c(625 ,312.5 ,312.5 ,625) ,nrow=2,ncol=2) 


nu0<—4 
S0<matrix(c(625 ,312.5 ,312.5 ,625) ,nrow=2,ncol=2) 
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The observed values y,,...,Yo2 are plotted as dots in the second panel 
of Figure 7.2. The sample mean is y = (47.18, 53.86)", the sample variances 
are s? = 182.16 and s3 = 243.65, and the sample correlation is s1,2/(s1s2) = 
0.70. Let’s use the Gibbs sampler described above to combine this sample 
information with our prior distributions to obtain estimates and confidence 
intervals for the population parameters. We begin by setting X (0) equal to the 
sample covariance matrix, and iterating from there. In the R-code below, Y is 
the 22 x 2 data matrix of the observed values. 


data(chapter7) ; Y<-Y. reading 
n<—dim(Y)[1] ; ybar<—apply(Y,2 ,mean) 
Sigma<—cov(Y) ; THETA<—SIGMA<—NULL 


set .seed (1) 
for(s in 1:5000) 


{ 


HHHupdate theta 

Ln<-solve( solve(LO) + n*solve(Sigma) ) 

mun<—Ln%*%(_ solve (L0)%*x%mu0 + nxsolve (Sigma)%*%ybar_ ) 
theta<rmvnorm (1 ,mun, Ln) 


TAH 


#rupdate Sigma 

Sn< SO + ( t(Y)—c(theta) )%*%t( t(Y)—c(theta) ) 
Sigma<-solve( rwish(1, nu0+n, solve(Sn)) ) 

THA 


HHH save results 
THETA<—rbind (THETA, theta) ; SIGMA<—rbind (SIGMA, c (Sigma )) 
THA 


i 
The above code generates 5,000 values ({0Y, SM}),..., £90 , 376000) }) 
whose empirical distribution approximates p(0, X|y,,...,y,,). It is left as an 


exercise to assess the convergence and autocorrelation of this Markov chain. 
From these samples we can approximate posterior probabilities and confidence 
regions of interest. 
> quantile( THETA[,2]—THETA[,1], prob=c(.025,.5,.975) ) 
2.5% 50% 97.5% 
1.513573 6.668097 11.794824 


> mean( THETA ,2] >THETA| ,1]) 

[1] 0.9942 

The posterior probability Pr(02 > 0i|y,,..-,;Y,) = 0.99 indicates strong ev- 
idence that, if we were to give exams and instruction to a large population 
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of children, then the average score on the second exam would be higher than 
that on the first. This evidence is displayed graphically in the first panel of 
Figure 7.2, which shows 97.5%, 75%, 50%, 25% and 2.5% highest posterior 
density contours for the joint posterior distribution of 0 = (01,02)7. A high- 
est posterior density contour is a two-dimensional analogue of a confidence 
interval. The contours for the posterior distribution of 0 are all mostly above 
the 45-degree line 6; = 63. 


Fig. 7.2. Reading comprehension data and posterior distributions 


Now let’s ask a slightly different question - what is the probability that 
a randomly selected child will score higher on the second exam than on 
the first? The answer to this question is a function of the posterior pre- 
dictive distribution of a new sample (Yj, Y2)", given the observed values. 
The second panel of Figure 7.2 shows highest posterior density contours of 
the posterior predictive distribution, which, while mostly being above the 
line y2 = yı, still has substantial overlap with the region below this line, 
and in fact Pr(Y2 > Yily,,---;Y,) = 0.71. How should we evaluate the 
effectiveness of the between-exam instruction? On one hand, the fact that 
Pr(62 > O1\y1,---;Yn) = 0.99 seems to suggest that there is a “highly 
significant difference” in exam scores before and after the instruction, yet 
Pr(Y2 > Yilyy,---,Y,) = 0.71 says that almost a third of the students will 
get a lower score on the second exam. The difference between these two prob- 
abilities is that the first is measuring the evidence that 02 is larger than 0; 
without regard to whether or not the magnitude of the difference 02 — 41 is 
large compared to the sampling variability of the data. Confusion over these 
two different ways of comparing populations is common in the reporting of 
results from experiments or surveys: studies with very large values of n often 
result in values of Pr(@2 > 01|Y1,---, Yn) that are very close to 1 (or p-values 


7.5 Missing data and imputation 115 


that are very close to zero), suggesting a “significant effect,” even though 


such results say nothing about how large of an effect we expect to see for a 
randomly sampled individual. 
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Fig. 7.3. Physiological data on 200 women. 


Figure 7.3 displays univariate histograms and bivariate scatterplots for 
four variables taken from a dataset involving health-related measurements on 
200 women of Pima Indian heritage living near Phoenix, Arizona (Smith et al, 
1988). The four variables are glu (blood plasma glucose concentration), bp 
(diastolic blood pressure), skin ( skin fold thickness) and bmi (body mass 
index). The first ten subjects in this dataset have the following entries: 
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glu bp skin bmi 
86 68 28 30.2 
195 70 33 NA 
77 82 NA 35.8 
NA 76 43 47.9 
107 60 NA NA 
97 76 27 NA 
NA 58 31 34.3 
193 50 16 25.9 
142 80 15 NA 
10 128 78 NA 43.3 


oono A wne 


The NA’s stand for “not available,” and so some data for some individuals 
are “missing.” Missing data are fairly common in survey data: Sometimes 
people accidentally miss a page of a survey, sometimes a doctor forgets to 
write down a piece of medical data, sometimes the response is unreadable, 
and so on. Many surveys (such as the General Social Survey) have multiple 
versions with certain questions appearing in only a subset of the versions. As 
a result, all the subjects may have missing data. 

In such situations it is not immediately clear how to do parameter estima- 
tion. The posterior distribution for 0 and X depends on []}_, p(y;l0, X), but 
p(y;|@, X) cannot be computed if components of y; are missing. What can 
we do? Unfortunately, many software packages either throw away all subjects 
with incomplete data, or impute missing values with a population mean or 
some other fixed value, then proceed with the analysis. The first approach is 
bad because we are throwing away a potentially large amount of useful infor- 
mation. The second is statistically incorrect, as it says we are certain about 
the values of the missing data when in fact we have not observed them. 

Let’s carefully think about the information that is available from subjects 
with missing data. Let O; = (O1,...,Op)T be a binary vector of zeros and 
ones such that O;,; = 1 implies that Y; į is observed and not missing, whereas 
Oi j = 0 implies Y;,; is missing. Our observed information about subject i 
is therefore O; = o; and Y; j = yi; for variables j such that oij = 1. For 
now, we’ll assume that missing data are missing at random, meaning that O; 
and Y; are statistically independent and that the distribution of O; does not 
depend on @ or X. In cases where the data are missing but not at random, 
then sometimes inference can be made by modeling the relationship between 
O;, Y; and the parameters (see Chapter 21 of Gelman et al (2004)). 

In the case where data are missing at random, the sampling probability 
for the data from subject i is 


ploi, {Yi,5 : oij = 1}10, X) = ploi) x p({yig : oij = 1}/0, X) 


Seas / p(vias Veg 2) I aye 


Yi,j:Oi,j=0 
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In words, our sampling probability for data from subject i is p(o;) mul- 
tiplied by the marginal probability of the observed variables, after inte- 
grating out the missing variables. To make this more concrete, suppose 
Y= (Yi1, NA, Yi,3, NA)T, SO Oj = (1,0, 10) Then 


ploi, Yi 1, ¥i,3|8, X) = ploi) x plyi,1, yi,310, X) 
= p(oi) x J vole: =) dyz dys. 


So the correct thing to do when data are missing at random is to integrate over 
the missing data to obtain the marginal probability of the observed data. In 
this particular case of the multivariate normal model, this marginal probabil- 
ity is easily obtained: p(y;,1, Yi,3|\0, X) is simply a bivariate normal density with 
mean (6;,03)7 and a covariance matrix made up of (0?, 01,3, 03). But combin- 
ing marginal densities from subjects having different amounts of information 
can be notationally awkward. Fortunately, our integration can alternatively 
be done quite easily using Gibbs sampling. 


Gibbs sampling with missing data 


In Bayesian inference we use probability distributions to describe our infor- 
mation about unknown quantities. What are the unknown quantities for our 
multivariate normal model with missing data? The parameters 0 and X are 
unknown as usual, but the missing data are also an unknown but key com- 
ponent of our model. Treating it as such allows us to use Gibbs sampling to 
make inference on 0, X, as well as to make predictions for the missing values. 

Let Y be the n x p matrix of all the potential data, observed and unob- 
served, and let O be the n x p matrix in which 0;,; = 1 if Y; j is observed and 
oij = 0 if Y; j is missing. The matrix Y can then be thought of as consisting 
of two parts: 


e Yops = {yij : 0i,; = 1}, the data that we do observe, and 
© Ymniss = {Yi,j : 01,7 = O}, the data that we do not observe. 


From our observed data we want to obtain p(0, X, Ymiss|Y obs), the poste- 
rior distribution of unknown and unobserved quantities. A Gibbs sampling 
scheme for approximating this posterior distribution can be constructed by 
simply adding one step to the ers sampler presented in the previous sec- 
) }, we generate {9ET), HOt) yr), 


tion: Given saring values {), yC ae 


from {9's ) xl s) Y YE) } by 


miss 


1. sampling 8+? from p(O|Y os, Ying: 2) ; 


2. sampling Lit from p(Ñ] Ys ¥ 518°") 3 


miss? 


3. sampling yE +1) from p(y miss Xoops gst) Ary), 


miss 


Note that in steps 1 and 2, the fixed value of Yobs combines with the current 
(s) 


value of Y iss to form a current version of a complete data matrix YS) having 
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no missing values. The n rows of the matrix of Y*) can then be plugged into 
formulae 7.6 and 7.9 to obtain the full conditional distributions of 0 and X. 
Step 3 is a bit more complicated: 


pY aal Y oba 0, X) x 


3 


DUY mise; Y obs |ð, X) 


Il 


Il 
m 


PUY; miss» Yi,ovsl9, X) 


= 


p(y, miss| Yi obs) 0 , 5), 
1 


s. 
Il 


so for each i we need to sample the missing elements of the data vector condi- 
tional on the observed elements. This is made possible via the following result 
about multivariate normal distributions: Let y ~ multivariate normal(0, X), 
let a be a subset of variable indices {1,...,p} and let b be the complement of 
a. For example, if p = 4 then perhaps a = {1,2} and b = {3,4}. If you know 
about inverses of partitioned matrices you can show that 


{YoY 9, X} ~ multivariate normal(95)., Yoja), where 
Orja = Op + Xio a](Zla,a]) " (Yja] — fay) (7.10) 
bla = Zibb] = Pinal al) Elaa (7.11) 


In the above formulae, @),) refers to the elements of @ corresponding to the 
indices in b, and Xja b} refers to the matrix made up of the elements that are 
in rows a and columns b of X. 

Let’s try to gain a little bit of intuition about what is going on in Equations 
7.10 and 7.11. Suppose y is a sample from our population of four variables glu, 
bp, skin and bmi. If we have glu and bp data for someone (a = {1, 2}) but are 
missing skin and bmi measurements (b = {3,4}), then we would be interested 
in the conditional distribution of these missing measurements yy) given the 
observed information y/,). Equation 7.10 says that the conditional mean of 
skin and bmi start off at their unconditional mean 01), but then are modified 
by (Yray —6j,)). For example, if a person had higher than average values of glu 
and bp, then (Yja] — 9[a)) would be a 2 x 1 vector of positive numbers. For our 
data the 2x 2 matrix Xi, a] (Xļa,a]) ~" has all positive entries, and so Obja > 41). 
This makes sense: If all four variables are positively correlated, then if we 
observe higher than average values of glu and bp, we should also expect 
higher than average values of skin and bmi. Also note that Xa is equal to 
the unconditional variance Xp») but with something subtracted off, suggesting 
that the conditional variance is less than the unconditional variance. Again, 
this makes sense: having information about some variables should decrease, 
or at least not increase, our uncertainty about the others. 

The R code below implements the Gibbs sampling scheme for missing data 
described in steps 1, 2 and 3 above: 


7.5 Missing data and imputation 119 


data(chapter7) ; Y<-Y.pima. miss 

det prior parameters 

n<—dim(Y)[1] ; p<dim(Y) [2] 

mul<~—c (120 ,64 ,26 ,26) 

sd0<—(mu0/2) 

LO<matrix(.1,p,p) ; diag(L0)<—1 ; LO<—LOxouter(sd0,sd0) 
nu0<—p+2 ; S0<—LO 

HHH 


Hee starting values 
Sigma<—S0 

Y. full<Y 
O<—1x(!is .na(Y)) 
tow (jy ma ilem) 


Y. full[is .na(Y. full[,j]) ,j)<—mean(Y. full|,j] ,na.rm=fRUE) 


} 
THE 


# Gibbs sampler 
THETA<-SIGMA<-Y . MISS<—NULL 
set .seed (1) 

for(s in 1:1000) 


{ 


HHHupdate theta 

ybar<—apply(Y. full ,2 ,mean) 

Ln<-solve( solve(LO) + n*solve (Sigma) ) 

mun<—Ln%*%(_ solve (L0)%*x%mu0 + nxsolve (Sigma)%*%ybar_ ) 
theta<rmvnorm (1 ,mun, Ln) 


TAA 


HHHupdate Sigma 

Sn< SO + ( t(Y.full)—c(theta) )%*%t( t(Y.full)—c(theta) ) 
Sigma<solve( rwish(1, nu0+n, solve(Sn)) ) 

THA 


#hupdate missing data 
tow (a iim tem) 
{ 
b < ©) — On) 
a <— ( O[i,J==1 ) 
iSa<— solve(Sigma[a,a]) 
beta.j <— Sigma[b,a]%*%iSa 
Sigma. j <— Sigma[b,b] — Sigma[b, a]%*%iSa%*%Sigma [a,b] 
theta.j<— theta[b] + beta. j%«*%(t(Y. full [i,a])—theta/[a]) 
Y.full[i,b] <— rmvnorm(1,theta.j,Sigma.j ) 
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HHH save results 
THETA<—rbind (THETA, theta) ; SIGMA<—rbind (SIGMA, c (Sigma) ) 
Y.MISS<rbind(Y.MISS, Y. full [O==0] ) 
THA 
I 
THA 


The prior mean of pọ = (120, 64, 26, 26)” was obtained from national averages, 
and the prior variances were based primarily on keeping most of the prior 
mass on values that are above zero. These prior distributions are likely much 
more diffuse than more informed prior distributions that could be provided 
by someone who is familiar with this population or these variables. 

The Monte Carlo approximation of E[O|y,,..., Yn] is (123.46, 71.03, 29.35, 
32.18), obtained by averaging the 1,000 6-values generated by the Gibbs sam- 
pler. Posterior confidence intervals and other quantities can additionally be 
obtained in the usual way from the Gibbs samples. We can also average the 
1,000 values of X to obtain E[X|y,,...,y,,], the posterior expectation of X. 
However, when looking at associations among a set of variables, it is often the 
correlations that are of interest and not the covariances. To each covariance 
matrix X there corresponds a correlation matrix C, given by 


C= fen: Oe = Di Bua} 


We can convert our 1,000 posterior samples of X into 1,000 posterior samples 
of C using the following R-code: 


COR <— array( dim=c(p,p,1000) ) 
for(s in 1:1000) 


i 
Sig<-matrix( SIGMA[s,] ,nrow=p, ncol=p) 
COR|,,s] <— Sig/sqrt( outer( diag(Sig),diag(Sig) ) ) 
} 
This code generates a 4x 4x 1000 array, where each “slice” is a 4x4 correlation 
matrix generated from the posterior distribution. The posterior expectation 


of C is 
1.00 0.23 0.25 0.19 


0.23 1.00 0.25 0.24 
E[Clyi,---.Ynl = | 9.95 0.25 1.00 0.65 
0.19 0.24 0.65 1.00 


and marginal posterior 95% quantile-based confidence intervals can be ob- 
tained with the command — apply(COR, c(1,2), quantile,prob=c(.025,.975) ) . 
These are displayed graphically in the left panel of Figure 7.4. 


Prediction and regression 


Multivariate models are often used to predict one or more variables given the 
others. Consider, for example, a predictive model of glu based on measure- 
ments of bp, skin and bmi. Using a = {2,3,4} and b = {1} in Equation 7.10, 
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Fig. 7.4. Ninety-five percent posterior confidence intervals for correlations (left) 
and regression coefficients (right). 


the conditional mean of yp) =glu, given numerical values of y/,) = {bp, skin, 
bmi }, is given by 


Elywl0, X, Yay] = Ol) + Boja(Yja} — Pla) 


where Bija = Dival Eaa) Since this takes the form of a linear regression 
model, we call the value of Bpja the regression coefficient for yj) given Ya] 
based on X. Values of Boja can be computed for each posterior sample of 
X, allowing us to obtain posterior expectations and confidence intervals for 
these regression coefficients. Quantile-based 95% confidence intervals for each 
of {A4)234, B2134 93)124, Baj1237 are shown graphically in the second column 
of Figure 7.4. The regression coefficients often tell a different story than the 
correlations: The bottom row of plots, for example, shows that while there 
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is strong evidence that the correlations between bmi and each of the other 
variables are all positive, the plots on the right-hand side suggest that bmi is 


nearly conditionally independent of glu and bp given skin. 
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Fig. 7.5. True values of the missing data versus their posterior expectations. 


Out-of-sample validation 


Actually, the dataset we just analyzed was created by taking a complete data 
matrix with no missing values and randomly replacing 10% of the entries with 
NA’s. Since the original dataset is available, we can compare values predicted 
by the model to the actual sample values. This comparison is made graphically 
in Figure 7.5, which plots the true value of y; j against its posterior mean for 
each {i,j} such that 0;,; = 0. It looks like we are able to do a better job 
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predicting missing values of skin and bmi than the other two variables. This 
makes sense, as these two variables have the highest correlation. If skin is 
missing, we can make a good prediction for it based on the observed value of 
bmi, and vice-versa. Such a procedure, where we evaluate how well a model 
does at predicting data that were not used to estimate the parameters, is 
called out-of-sample validation, and is often used to quantify the predictive 
performance of a model. 


7.6 Discussion and further references 


The multivariate normal model can be justified as a sampling model for rea- 
sons analogous to those for the univariate normal model (see Section 5.7): It is 
characterized by independence between the sample mean and sample variance 
(Rao, 1958), it is a maximum entropy distribution and it provides consistent 
estimation of the population mean and variance, even if the population is not 
multivariate normal. 

The multivariate normal and Wishart distributions form the foundation 
of multivariate data analysis. A classic text on the subject is Mardia et al 
(1979), and one with more coverage of Bayesian approaches is Press (1982). 
An area of much current Bayesian research involving the multivariate nor- 
mal distribution is the study of graphical models (Lauritzen, 1996; Jordan, 
1998). A graphical model allows elements of the precision matrix to be ex- 
actly equal to zero, implying some variables are conditionally independent of 
each other. A generalization of the Wishart distribution, known as the hyper- 
inverse-Wishart distribution, has been developed for such models (Dawid and 
Lauritzen, 1993; Letac and Massam, 2007). 


8 


Group comparisons and hierarchical modeling 


In this chapter we discuss models for the comparison of means across groups. 
In the two-group case, we parameterize the two population means by their 
average and their difference. This type of parameterization is extended to the 
multigroup case, where the average group mean and the differences across 
group means are described by a normal sampling model. This model, to- 
gether with a normal sampling model for variability among units within a 
group, make up a hierarchical normal model that describes both within-group 
and between-group variability. We also discuss an extension to this normal 
hierarchical model which allows for across-group heterogeneity in variances in 
addition to heterogeneity in means. 


8.1 Comparing two groups 


The first panel of Figure 8.1 shows math scores from a sample of 10th grade 
students from two public U.S. high schools. Thirty-one students from school 
1 and 28 students from school 2 were randomly selected to participate in a 
math test. Both schools have a total enrollment of around 600 10th graders 
each, and both are in urban neighborhoods. 

Suppose we are interested in estimating 61, the average score we would 
obtain if all 10th graders in school 1 were tested, and possibly comparing it 
to 02, the corresponding average from school 2. The results from the sample 
data are Jų = 50.81 and y = 46.15, suggesting that 0, is larger than 62. 
However, if different students had been sampled from each of the two schools, 
then perhaps #2 would have been larger than ğı. To assess whether or not the 
observed mean difference of Yı — Y2 = 4.66 is large compared to the sampling 
variability it is standard practice to compute the t-statistic, which is the ratio 
of the observed difference to an estimate of its standard deviation: 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_8, 
© Springer Science+Business Media, LLC 2009 
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Yi — Y2 
SpV/1/m + 1/n2 
50.81 — 46.15 


10.44,/1/31 + 1/28 


t(Y1,Yo) = 


where s2 = [(m1 — 1)s{ + (n2 — 1)85]/(m1 + n2 — 2), the pooled estimate of 
the population variance of the two groups. Is this value of 1.74 large? From 
introductory statistics, we know that if the population of scores from the two 
schools are both normally distributed with the same mean and variance, then 
the sampling distribution of the t-statistic t(Y 1, Y2) is a t-distribution with 
ny, + n2 — 2 = 57 degrees of freedom. The density of this distribution is plot- 
ted in the second panel of Figure 8.1, along with the observed value of the 
t-statistic. If the two populations indeed follow the same normal population, 
then the pre-experimental probability of sampling a dataset that would gener- 
ate a value of t((Y 1, Y 2) greater in absolute value than 1.74 is p = 0.087. You 
may recall that this latter number is called the (two-sided) p-value. While 
a small p-value is generally considered as indicating evidence that 6; and 62 
are different, the p-value should not be confused with the probability that 
0, = 02. Although not completely justified by statistical theory for this pur- 
pose, p-values are often used in parameter estimation and model selection. 
For example, the following is a commonly taught data analysis procedure for 
comparing the population means of two groups: 
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Fig. 8.1. Boxplots of samples of 10th grade math scores from two schools, and the 
null distribution for testing equality of the population means. The gray line indicates 
the observed value of the t-statistic. 
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Model selection based on p-values: 


If p < 0.05, 

reject the model that the two groups have the same distribution; 
— conclude that 0; # 02; . 

— use the estimates 01 = 91, 02 = Yo. 

If p > 0.05 

accept the model that the two groups have the same distribution; 
— conclude that 01 = 62; 

= use the estimates 6, => 65 = (> Yi,1 + 5 Yi,2)/(nı + nə). 


| 


This data analysis procedure results in either treating the two populations 
as completely distinct or treating them as exactly identical. Do these rather 
extreme alternatives make sense? For our math score data, the above proce- 
dure would take the p-value of 0.087 and tell us to treat the population means 
of the two groups as being numerically equivalent, although there seems to 
be some evidence of a difference. Conversely, it is not too hard to imagine 
a scenario where the sample from school 1 might have included a few more 
high-performing students, the sample from school 2 a few more low-performing 
students, in which case we could have observed a p-value of 0.04 or 0.05. In 
this latter case we would have treated each population separately, using only 
data from school 1 to estimate 0 and similarly for school 2. This latter ap- 
proach seems somewhat inefficient: Since the two samples are both measuring 
the same thing on similar populations of students, it might make sense to use 
some of the information from one group to help estimate the mean in the 
other. 

The p-value-based procedure described above can be re-expressed as es- 
timating 6, as 6, = wy, + (1 — w)%e, where w = 1 if p < 0.05 and 
w = nı/(nı + n2) otherwise. Instead of using such an extreme procedure, 
it might make more sense to allow w to vary continuously and have a value 
that depends on such things as the relative sample sizes nı and ng, the sam- 
pling variability ø? and our prior information about the similarities of the two 
populations. An estimator similar to this is produced by a Bayesian analy- 
sis that allows for information to be shared across the groups. Consider the 
following sampling model for data from the two groups: 


Yii = ptot ein 
Yi2=pw—-d+62 


{eij} ~ iid. normal(0, o°) . 


Using this parameterization where 0; = u + ô and 02 = u — 6, we see that ô 
represents half the population difference in means, as (0; — 02)/2 = ô, and p 
represents the pooled average, as (01 + 02) /2 = u. Convenient conjugate prior 
distributions for the unknown parameters are 
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p(u, 5,07) = p(u) x p(d) x plo’) 
p ~ normal(i9, 76) 
6 ~ normal(6d9, Té) 


o° ~ inverse-gamma(v9/2, voog /2). 


It is left as an exercise to show that the full conditional distributions of these 
parameters are as follows: 


{HIY1 Yo, ð, o°} G normal(un, %2), where 
Hin = Yn X [Mo/¥ + Xiz (Yia — 8)/0° + 521 lyi + 6)/07] 
Ya = [1/76 + (m1 + n2)/07]-* 


{lY Y2, H, 0? } ~ normal(6,,,72), where 
Ön = TA X [50/79 +X (yi1 — u) /0? — X (yi2 — u) /0°] 


Ta = [1/78 + (ma + n2)/o°] t 


o7\Y4, Yo, u, Ô} ~ inverse-gamma (vn /2, Vnaž/2), where 
1 2 n 
Vn = W + nı + Ng 
Vnan = 09} + X lyi — [e +0)? + Elui — lu- 8)? 


Although these formulae seem quite involved, you should try to convince your- 
self that they make sense. One way to do this is to plug in extreme values for 
the prior parameters. For example, if vọ = 0 then 


2 Dlia u +i) +E (lui — [u — 8)? 


g„ = 
Nı T N2 


n 


which is a pooled-sample estimate of the variance if the values of u and 6 were 
known. Similarly, if pọ = ĝo = 0 and yê = TÊ = œo, then (defining 0/œ = 0) 


_— Li i) +) (yi + 9) 5 = (Yi — u) — (yi2 — B) 
al nı + n2 7 ny + n2 


and if you plug in un for u and ôn for 6, you get J, = Un +ôn and Y2 = Hn — ôn- 
Analysis of the math score data 


The math scores were based on results of a national exam in the United States, 
standardized to produce a nationwide mean of 50 and a standard deviation 
of 10. Unless these two schools were known in advance to be extremely ex- 
ceptional, reasonable prior parameters can be based on this information. For 
the prior distributions of u and 07, we'll take pọ = 50 and of = 10? = 100, 
although this latter value is likely to be an overestimate of the within-school 
sampling variability. We’ll make these prior distributions somewhat diffuse, 
with y = 25? = 625 and 1 = 1. For the prior distribution on ô, choosing 
ĉo = 0 represents the prior opinion that 6; > 62 and 62 > 6; are equally 
probable. Finally, since the scores are bounded between 0 an 100, half the 
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difference between @, and 62 must be less than 50 in absolute value, so a value 
of rẹ = 25? = 625 seems reasonably diffuse. 

Using the full-conditional distributions given above, we can construct a 
Gibbs sampler to approximate the posterior distribution p(u, 6,07|y,, Y2). R- 
code to do the approximation appears below: 


data(chapter8 ) 

yl<-y.schooll ; nl<length(yl1) 
y2<-y.school2 ; n2<—length(y2) 
HHHH prior parameters 
mu0<—50 ; g02<—625 

del0<—0 ; t02<—625 

s20<—100; nu0<—1 

THA 


HHHH Starting values 

mu<— ( mean(yl) + mean(y2) )/2 
del ( mean(yl) — mean(y2) )/2 
TAA 


HAHH Gibbs sampler 
MUG—DEL<S2<—NULL 


set .seed (1) 
for(s in 1:5000) 


{ 
##Hupdate s2 
s2<—1/rgamma(1 ,(nu0+n1+n2)/2, 
(nu0*«s20+sum ((yl—mu-del)*2)+sum((y2—mu+del)*2) )/2) 
tHE 
7Afupdate mu 
var.mu<— 1/(1/g02+ (nl+n2)/s2 ) 
mean .mu<—var .mu*(mu0/g02+sum(y1l—del) /s2+sum(y2+del)/s2) 
mu<—rnorm (1 ,mean.mu, sqrt (var.mu) ) 
tHE 
7Afupdate del 
var.del< 1/(1/t02+ (nl+n2)/s2 ) 
mean. del<—var . del «(del0/t02+sum(yl—mu) /s2—sum(y2-—mu) /s2 ) 
del<rnorm(1,mean. del , sqrt (var. del )) 
tHE 


#Hfsave parameter values 
MU<—c (MU,mu) ; DEL<—c(DEL,del) ; S2<-c(S2,s2) 


THA 
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Figure 8.2 shows the marginal posterior distributions of u and 6, and how they 
are much more concentrated than their corresponding prior distributions. In 
particular, a 95% quantile-based posterior confidence interval for 2 x 6, the 
difference in average scores between the two schools, is (—.61,9.98). Although 
this interval contains zero, the differences between the prior and posterior 
distributions indicate that we have gathered substantial evidence that the 
population mean for school 1 is higher than that of school 2. Additionally, the 
posterior probability Pr(@1 > 62|y,,y2) = Pr(d > Oly,, Y2) = 0.96, whereas 
the corresponding prior probability was Pr(é > 0) = 0.50. However, we should 
be careful not to confuse this probability with the probability that a randomly 
selected student from school 1 has a higher score than one sampled from school 
2. This latter probability can be obtained from the joint posterior predictive 
distribution, which gives Pr(Yi > Yaly,, Y2) ~ 0.62. 
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Fig. 8.2. Prior and posterior distributions for and 6. 


8.2 Comparing multiple groups 


The data in the previous section was part of the 2002 Educational Longitu- 
dinal Study (ELS), a survey of students from a large sample of schools across 
the United States. This dataset includes a population of schools as well as a 
population of students within each school. Datasets like this, where there is 
a hierarchy of nested populations, are often called hierarchical or multilevel. 
Other situations having the same sort of data structure include data on 


e patients within several hospitals, 
e genes within a group of animals, or 
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e people within counties within regions within countries. 


The simplest type of multilevel data has two levels, in which one level consists 
of groups and the other consists of units within groups. In this case we denote 
Yi,j as the data on the ith unit in group j. 


8.2.1 Exchangeability and hierarchical models 


Recall from Chapter 2 that a sequence of random variables Y1,..., Yn is ex- 
changeable if the probability density representing our information about the 
sequence satisfies p(y1,---,Yn) = P(Y- --, Yn, ) for any permutation 7. Ex- 
changeability is a reasonable property of p(yi,...,Yn) if we lack information 
distinguishing the random variables. For example, if Yj,...,Y, were math 
scores from n randomly selected students from a particular school, then in 
the absence of other information about the students we might treat their 
math scores as exchangeable. If exchangeability holds for all values of n, then 
de Finetti’s theorem says that an equivalent formulation of our information 
is that 


o ~ ple) 
{¥1,--.,Ynlo} ~ iid. p(yl¢). 


In other words, the random variables can be thought of as independent samples 
from a population described by some fixed but unknown population feature 
¢. In the normal model, for example, we take ¢ = {9,07} and model the data 
as conditionally i.i.d. normal(0, 07). 

Now let’s consider a model describing our information about a hierarchical 
data {Y1,..., Y m}, where Y; = {Y1,;,..-,¥n,,j}. What properties should a 
model p(yj,---; Ym) have? Let’s consider first p(y;) = p(y1,j,--++Yn,,7), the 
marginal probability density of data from a single group j. The discussion 
in the preceding paragraph suggests that we should not treat Y1j,...,¥nj;,j 
as being independent, as doing so would imply, for example, that p(yn,,j| 
Urgen ia) = Plia) and that the values of Yis os , Yn;—1,5 would 
give us no information about Yn,,j- However, if all that is known about 
Y1,j,--+,Yn;,j is that they are random samples from group j, then treating 
Y1,j,--+,Yn;,j as exchangeable makes sense. If group j is large compared to the 
sample size nj, then de Finetti’s theorem and results of Diaconis and Freed- 
man (1980) say that we can model the data within group j as conditionally 
i.i.d. given some group-specific parameter Qj: 


{Yije Yn loj} ~ iid. p(yldy) . 


But how should we represent our information about ¢ġ1,..., Øm? As before, 
we might not want to treat these parameters as independent, because doing 
so would imply that knowing the values of ¢1,...,@m—1 does not change 
our information about the value of ¢,,. However, if the groups themselves 
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are samples from some larger population of groups, then exchangeability of 
the group-specific parameters might be appropriate. Applying de Finetti’s 
theorem a second time gives 


{o1,---,bm|Y} ~ iid. p(gly) 


for some sampling model p(¢|y) and an unknown parameter w. This double 
application of de Finetti’s theorem has led us to three probability distribu- 
tions: 


{Y1,j5+++5 Yn; jlo} ~ iid. p(y|@;) (within-group sampling variability) 
{¢1,---,dm|v} ~ iid. ploi) (between-group sampling variability) 
Y ~ p(w) (prior distribution) 


It is important to recognize that the distributions p(y|@) and p(¢|q) both 
represent sampling variability among populations of objects: p(y|¢) represents 
variability among measurements within a group and p(@|q) represents vari- 
ability across groups. In contrast, p(w) represents information about a single 
fixed but unknown quantity. For this reason, we refer to p(y|¢) and p(@|~) 
as sampling distributions, and are conceptually distinct from p(y), which is a 
prior distribution. In particular, the data will be used to estimate the within- 
and between-group sampling distributions p(y|¢) and p(¢|w), whereas the 
prior distribution p(y) is not estimated from the data. 


8.3 The hierarchical normal model 


A popular model for describing the heterogeneity of means across several pop- 
ulations is the hierarchical normal model, in which the within- and between- 
group sampling models are both normal: 


6; = {0;,07}, p(y|¢;) = normal(6;, o°) (within-group model) (8.1) 
w= {u, T°}, p(O;|v) = normal(u,r?) (between-group model) (8.2) 


It might help to visualize this setup as in Figure 8.3. Note that p(¢|w) only 
describes the heterogeneity across group means, and not any heterogeneity in 
group-specific variances. In fact, the within-group sampling variability o? is 
assumed to be constant across groups. At the end of this chapter we will 
eliminate this assumption by adding a component to the model that allows 
for group-specific variances. 

The fixed but unknown parameters in this model are y,7? and o?. For 
convenience we will use standard semiconjugate normal and inverse-gamma 
prior distributions for these parameters: 


1/o? ~ gamma (19/2, voog /2) 
1/7? ~ gamma (70/2, norè/2) 


p~ normal (uo, y0) 
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Fig. 8.3. A graphical representation of the basic hierarchical normal model. 


8.3.1 Posterior inference 


The unknown quantities in our system include the group-specific means 
{01,..., Om}, the within-group sampling variability o? and the mean and vari- 
ance (1,77) of the population of group-specific means. Joint posterior infer- 
ence for these parameters can be made by constructing a Gibbs sampler which 
approximates the posterior distribution p(01,..., 0m, H, T”, 0°|Y1;---; Ym) 

The Gibbs sampler proceeds by iteratively sampling each parameter from 
its full conditional distribution. Deriving the full conditional distributions in 
this highly parameterized system may seem like a daunting task, but it turns 
out that all of the necessary technical details have been covered in Chapters 
5 and 6. All that is required of us at this point is that we recognize certain 
analogies between the current model and the univariate normal model. Useful 
for this will be the following factorization: 


plô... Om, hi T, PlU: -3 Ym) 
x plu, T’, 0°) x (Oi, nO |i yo) KD -3 Ymlbt 130ml T ye) 
m Ng 


= plu)p(r*)p(o*) 9 J] 207 ¢ 5 I s ¢ (8.3) 


j=1 i=1 


The term in the second pair of brackets is the result of an important condi- 
tional independence feature of our model. Conditionally on {61, ..., Om, H, T?, 
o*}, the random variables Yj,;,... , Yn;,j are independent with a distribution 
that depends only on 6; and g? and not on p or 7°. It is helpful to think about 
this fact in terms of the diagram in Figure 8.3: The existence of a path from 
(u, T?) to each Y ; indicates that while (u, 7°) provides information about Y ;, 
it only does so indirectly through 6;, which separates the two quantities in 
the graph. 


Full conditional distributions of p and T? 


As a function of u and 7”, the term in Equation 8.3 is proportional to 


p(u)p(77) II p(0;|u, T), 
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and so the full conditional distributions of u and 7? are also proportional to 
this quantity. In particular, this must mean that 


plulbi, -Om T0, Yis- --3Ym) X Ha |u, T? 
PIT biram b Oise cag ig X r°) |] (6; |u, 7? 


These distributions are exactly the full conditional distributions from the one- 
sample normal problem in Chapter 6. In Chapter 6 we derived the full condi- 
tionals of the population mean and variance of a normal population, assuming 
independent normal and inverse-gamma prior distributions. In our current sit- 
uation, 6),...,4m are the i.i.d. samples from a normal population, and u and 
T? are the unknown population mean and variance. In Chapter 6, we saw that 
if Y1,---,Ym were i.i.d. normal(ĝ, c?) and @ had a normal prior distribution, 
then the conditional distribution of 0 was also normal. Since our current situ- 
ation is exactly analogous, the fact that 01,...,@m are i.i.d. normal(,77) and 
4 has a normal prior distribution implies that the conditional distribution of 
u must be normal as well. Similarly, just as 7? had an inverse-gamma con- 
ditional distribution in Chapter 6, 7? has an inverse-gamma distribution in 
the current situation. Applying the results of Chapter 6 with the appropriate 
symbolic replacements, we have 


mo /T? + 2 T 
{uli . -0m T 21 ~ normal (Ro He mr? + 1/8] ‘) 
0 


2 8; — 2 
{1/7?|01,..-,Om, u} ~ gamma (= "hoto tal = ). 


Full conditional of 0; 


Collecting the terms in Equation 8.3 that depend on @; shows that the full 
conditional distribution of 6; must be proportional to 


Ng 


(0; |i, 77 07, Wis ++ +1 Bm) x p( (6; |u T’ ) [1 Yij|0j,07). 


This says that, conditional on {4, 7°, o°, y;}, 0; must be conditionally inde- 
pendent of the other 0’s as well as independent of the data from groups other 
than j. Again, it is helpful to refer to Figure 8.3: While there is a path from 
each 6; to every other x, the paths go through (u,7) or a°. We can think 
of this as meaning that the 0’s contribute no information about each other 
beyond that contained in 4,7? and o?. 

The terms in the above equation include a normal density for 0; multiplied 
by a product of normal densities where 6; is the mean. Mathematically, this is 
exactly the same setup as the one-sample normal model, in which p(8; |u, 77) 
is the prior distribution instead of the sampling model for the 6’s. The full 
conditional distribution is therefore 
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nj (07 + 1/7? 


Nylon fe * [nj/o° +1/7"]""). 


{95ly1,5; 666) Ynj, js o*} ~ normal ( 


Full conditional of o? 


Using Figure 8.3 and the arguments in the previous two paragraphs, you 
should convince yourself that o? is conditionally independent of {1,77} given 
{Y1 ---;Ym: 01,- --, 0m}. The derivation of the full conditional of o? is similar 
to that in the one-sample normal model, except now we have information 
about o? from m separate groups: 


m nj 
P(O7|91,-- +; Oms Y1- -3 Ym) X p(o’) II J [puil o) 
j=ii=1 


vosg Er 
x (o°) 49/2416 20? (0?) D 1 /2,- = #3) 


Adding the powers of g? and collecting the terms in the exponent, we recognize 
this as proportional to an inverse-gamma density, giving 


m Nj 


1 uk 1 
{1/0°l0, Y1- Yn} ~ gamma(; [vo + Sw, z ovo + Sis — 6;)*)). 
j=l 


j=l i=l 


Note that $ ` (yi; — 0j)? is the sum of squared residuals across all groups, 
conditional on the within-group means, and so the conditional distribution 
concentrates probability around a pooled-sample estimate of the variance. 


8.4 Example: Math scores in U.S. public schools 


Let’s return to the 2002 ELS data described previously. This survey included 
10th grade children from 100 different large urban public high schools, all 
having a 10th grade enrollment of 400 or greater. Data from these schools are 
shown in Figure 8.4, with scores from students within the same school plotted 
along a common vertical bar. 

A histogram of the sample averages is shown in the first panel of Figure 8.5. 
The range of average scores is quite large, with the lowest average being 36.6 
and the highest 65.0. The second panel of the figure shows the relationship 
between the sample average and the sample size. This plot seems to indicate 
that very extreme sample averages tend to be associated with schools with 
small sample sizes. For example, the school with the highest sample average 
has the lowest sample size, and many schools with low sample averages also 
have low sample sizes. This relationship between sample averages and sample 
size is fairly common in hierarchical datasets. To understand this phenomenon, 
consider a situation in which all 6;’s were equal to some common value, say 
0o, but the sample sizes were different. The expected value of each sample 
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Fig. 8.5. Empirical distribution of sample means, and the relationship between 
sample mean and sample size. 


average would then be E[Y;|6;,07] = 0; = 90, but the variances would depend 
on the sample size, since Var[Y; |o5] = 0? /n;. As a result, sample averages for 
groups with large sample sizes would be very close to 69, whereas the sample 
averages for groups with small sample sizes would be farther away, both less 
than and greater than 69. For this reason, it is not uncommon that groups 
with very high or very low sample averages are also those groups with low 
sample sizes. 
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8.4.1 Prior distributions and posterior approximation 


The prior parameters we need to specify are 


(Yo, og) for p(o?), 
(no, Te) for p(T) and 
(10, ¥) for p(y). 


As described above, the math exam was designed to give a nationwide variance 
of 100. Since this variance includes both within-school and between-school 
variance, the within-school variance should be at most 100, which we take as 
aĝ. This is likely to be an overestimate, and so we only weakly concentrate the 
prior distribution around this value by taking vo = 1. Similarly, the between- 
school variance should not be more than 100, and so we take rë = 100 and 
no = 1. Finally, the nationwide mean over all schools is 50. Although the mean 
for large urban public schools may be different than the nationwide average, 
it should not differ by too much. We take uo = 50 and yẹ = 25, so that the 
prior probability that u is in the interval (40,60) is about 95%. 

Posterior approximation proceeds by iterative sampling of each unknown 
quantity from its full conditional distribution. Given a current state of the 
unknowns f(s), ..., 0%, us), 720), 7(5)}, a new state is generated as follows: 
sample pt) ~ p(y|O\,... 0,720); 
sample 72(¢+) ~ p(r2|04*, thes oss) utd); 
sample g?(s+1) Pa p(o2|0, sey ae), Yis- iUm); 

4. for each j € {1,..., m}, sample ger) erp; eet) ater) ete), Y;)- 


oN 


The order in which the new parameters are generated does not matter. What 
does matter is that each parameter is updated conditional upon the most 
current value of the other parameters. This Gibbs sampling procedure can be 
implemented in R with the code below: 


data(chapter8) ; Y<-Y.school.mathscore 


tHe weakly informative priors 
nu0<—1 ; s20<—100 

eta0<—1 ; t20<—100 

mu0<—50 ; g20<—25 

HHE 


dee starting values 
mnx—length (unique(Y[,1])) 
n<—sv<ybar<rep (NA,m) 
for(j in 1:m) 


ybar [ j]j<—mean(Y[Y[,1]== 
sy | iiK=ver OFM t)==3 .2 
n| j]j<—sum(Y[,1]==j) 
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H 
theta<-ybar ; sigma2<—mean(sv) 
mu<—mean(theta) ; tau2<var(theta) 


THE 


FHA setup MCMC 

set .seed (1) 

S<—5000 

THETA<matrix( nrow=S , ncol=m) 
SMI<—matrix( nrow=S, ncol=3) 
THA 


# MCMC algorithm 
iore ma 1%) 
{ 


# sample new values of the thetas 
HOw (jj) ma fern) 


vtheta<—1/(n[j]/sigma2+1/tau2) 
etheta<—vthetax(ybar | j]*n[j]/sigma2+mu/tau2 ) 
theta |j]|<—rnorm(1,etheta , sqrt (vtheta)) 


#sample new value of sigma2 

nun<—nu0-+sum (n) 

ss<nu0*s20 

for(j in 1:m){ss<ss+sum((Y[Y[,1]==j ,2]—theta[j])°2)} 
sigma2 <—1/rgamma(1,nun/2,ss /2) 


#sample a new value of mu 

vmu<— 1/(m/tau2+1/g20) 

emu<— vmux*(mkmean(theta)/tau2 + mu0/g20) 
mu<—rnorm (1 ,emu, sqrt (vmu) ) 


# sample a new value of tau2 
etam<eta0-hn 

ss< eta0*t20 + sum( (theta—mu)*2 ) 
tau2<—1/rgamma(1,etam/2,ss /2) 


#store results 
THETA([s,]<—theta 
SMT|s,]<—c(sigma2 ,mu, tau2) 


} 
THE 


Running this algorithm produces an S x m matrix THETA, containing a value 
of the within-school mean for each school at each iteration of the Markov 
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chain. Additionally, the S' x 3 matrix SMT stores values of o?, u and T?, repre- 
senting approximate, correlated draws from the posterior distribution of these 
parameters. 


MCMC diagnostics 


Before we make inference using these MCMC samples we should determine if 
there might be any problems with the Gibbs sampler. The first thing we want 
to do is to see if there are any indications that the chain is not stationary, 
i.e. if the simulated parameter values are moving in a consistent direction. 
One way to do this is with traceplots, or plots of the parameter values versus 
iteration number. However, when the number of samples is large, such plots 
can be difficult to read because of the high density of the plotted points (see, 
for example, Figure 6.6). Standard practice is to plot only a subsequence of 
MCMC samples, such as every 100th sample. Another approach is to pro- 
duce boxplots of sequential groups of samples, as is done in Figure 8.6. The 
first boxplot in the first plot, for example, represents the empirical distribu- 
tion of {u,..., 499}, the second boxplot represents the distribution of 
{SD 01099), and so on. Each of the 10 boxplots represents 1/10th of 
the MCMC samples. If stationarity has been achieved, then the distribution 
of samples in any one boxplot should be the same as that in any other. If 
we were to see that the medians or interquartile ranges of the boxplots were 
moving in a consistent direction with iteration number, then we would suspect 
that stationarity had not been achieved and we would have to run the chain 
longer. 


500 2000 3500 5000 500 2000 3500 5000 500 2000 3500 5000 
iteration iteration iteration 


Fig. 8.6. Stationarity plots of the MCMC samples of u, o° and 7°. 


There does not seem to be any evidence that the chain has not achieved 
stationarity, so we move on to see how quickly the Gibbs sampler is moving 
around the parameter space. Lag-1 autocorrelations for the sequences of p, 
a? and 7? are 0.15, 0.053 and 0.312, and the effective sample sizes are 3706, 
4499 and 2503, respectively. Approximate Monte Carlo standard errors can be 
obtained by dividing the approximated posterior standard deviations by the 
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square root of the effective sample sizes, giving values of (0.009, 0.04, 0.09) for 
u, a? and 7°? respectively. These are quite small compared to the scale of the 
approximated posterior expectations of these parameters, (48.12, 84.85, 24.84). 
Diagnostics should also be performed for the 6-parameters: The effective sam- 
ple sizes for the 100 sequences of 6-values ranged between 3,627 and 5,927, 
with Monte Carlo standard errors ranging between 0.02 and 0.05. 


8.4.2 Posterior summaries and shrinkage 


Figure 8.7 shows Monte Carlo approximations to the posterior densities of 
{u, o”, T°}. The posterior means of u, o and 7 are 48.12, 9.21 and 4.97 respec- 
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Fig. 8.7. Marginal posterior distributions, with 2.5%, 50% and 97.5% quantiles 
given by vertical lines. 


tively, indicating that roughly 95% of the scores within a classroom are within 
4x9.21 ~ 37 points of each other, whereas 95% of the average classroom scores 
are within 4 x 4.97 ~ 20 points of each other. 

One of the motivations behind hierarchical modeling is that information 
can be shared across groups. Recall that, conditional on u, T?, 0? and the 
data, the expected value of 6; is a weighted average of y; and p: 


_ 94nj/0? + w/t? 
E[A;ly;, HT, o] = nj /o? 4 1/7? 
As a result, the expected value of 0; is pulled a bit from y,; towards u by 
an amount depending on nj. This effect is called shrinkage. The first panel 
of Figure 8.8 plots y; versus 6; = E[@;\y1,---,Ym] for each group. Notice 
that the relationship roughly follows a line with a slope that is less than 
one, indicating that high values of y; correspond to slightly less high values 
of ĝ;, and low values of y; correspond to slightly less low values of ĝ;. The 
second panel of the plot shows the amount of shrinkage as a function of the 
group-specific sample size. Groups with low sample sizes get shrunk the most, 
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whereas groups with large sample sizes hardly get shrunk at all. This makes 
sense: The larger the sample size for a group, the more information we have 
for that group and the less information we need to “borrow” from the rest of 
the population. 
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Fig. 8.8. Shrinkage as a function of sample size. 


Suppose our task is to rank the schools according to what we think their 
performances would be if every student in each school took the math exam. 
In this case, it makes sense to rank the schools according to the school- 
specific posterior expectations {E[01|Y1;-- -Ymb ---, El0mlY1;---; Yml} AL 
ternatively, we could ignore the results of the hierarchical model and just use 
the school-specific sample averages {Ņ91,..., Ym}. The two methods will give 
similar but not exactly the same rankings. Consider the posterior distributions 
of 046 and s2, as shown in Figure 8.9. Both of these schools have exceptionally 
low sample means, in the bottom 10% of all schools. The first thing to note 
is that the posterior density for school 46 is more peaked than that of school 
82. This is because the sample size for school 46 is 21 students, whereas that 
of school 82 is only 5 students. Therefore, our degree of certainty for 046 is 
much higher than that for 02. 

The raw data for the two schools are shown in dotplots below the posterior 
densities, with the large dots representing the sample means Y4g and Yg2. Note 
that while the posterior expectation for school 82 is higher than that of 46 
(42.53 compared to 41.31), the sample mean for school 82 is lower than that of 
46 (38.76 compared to 40.18). Does this make sense? Suppose on the day of the 
exam the student who got the lowest exam score in school 82 did not come to 
class. Then the sample mean for school 82 would have been 41.99 as opposed to 
38.76, a change of more than three points. In contrast, if the lowest performing 
student in school 46 had not shown up, 4g would have been 40.9 as opposed 
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to 40.18, a change of only three quarters of a point. In other words, the low 
value of the sample mean for school 82 can be explained by either 0s2 being 
very low, or just the possibility that a few of the five sampled students were 
among the poorer performing students in the school. In contrast, for school 46 
this latter possibility cannot explain the low value of the sample mean: Even 
if a few of the sampled students were unrepresentative of their school-specific 
average, it would not affect the sample mean as much because of the larger 
sample size. For this reason, it makes sense to shrink the expectation of school 
82 towards the population expectation E[uly,,...,y,,] = 48.11 by a greater 
amount than for the expectation of school 46. 
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Fig. 8.9. Data and posterior distributions for two schools. 


To some people this reversal of rankings may seem strange or “unfair:” 
The performance by the sampled students in school 46 was better on average 
than those sampled in school 82, so why should they be ranked lower? While 
“fairness” may be debated, the hierarchical model reflects the objective fact 
that there is more evidence that 046 is exceptionally low than there is evidence 
that s2 is exceptionally low. There are many other real-life situations where 
differing amounts of evidence results in a switch of a ranking. For example, 
on any given basketball team there are “bench” players who play very few 
minutes during any given game. As such, many bench players have taken 
only a few free throws in their entire career, and many have an observed free 
throw shooting percentage of 100%. Under certain circumstances during a 
basketball game (a “technical foul”) the coach has the opportunity to choose 
from among any of his or her players to take a free throw and hopefully score 
a point. In practice, coaches always choose an experienced veteran player with 
a percentage of around 87% over a bench player who has made, for example 
three of three free throws in his career. While it may seem “unfair,” it is the 
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right decision to make: The coaches recognize that it is very unlikely that the 
bench player’s true free throw shooting percentage is anywhere near 100%. 


8.5 Hierarchical modeling of means and variances 


If the population means vary across groups, shouldn’t we allow for the 
possibility that the population variances also vary across groups? Letting 
o? be the variance for group j, our sampling model would then become 
Yijs--- Yny j ~ iid. normal(0;, oF), and our full conditional distribution for 
each 0; would be 


nj¥j/oF +1/7° 
njo; + 1/7? ’ 


10y wanes ~ normal ( [nj /o3 + i) ; 


How does o? get estimated? If we were to specify that 


o?,..., 02 ~ iid. gamma(vo/2, voog /2), (8.4) 


then as is shown in Chapter 6 the full conditional distribution of o? is 


{oF lyajs +- Yng 05} ~ gamma ([¥ + n]/2, [voo + X (vig = 83)?1/2) ; 


and estimation for o7,...,02, can proceed by iteratively sampling their values 
along with the other parameters in a Gibbs sampler. 

If vo and o@ are fixed in advance at some particular values, then the 
distribution in (8.4) represents a prior distribution on variances such that, for 
example, p(o?,|o7,...,07,_1) = p(o2,), and so the information we may have 
about o7,...,07,_, is not used to help us estimate o7,. This seems inefficient: 
If the sample size in group m were small and we saw that o7,...,07,_, were 
tightly concentrated around a particular value, then we would want to use this 
fact to improve our estimation of o?,. In other words, we want to be able to 
learn about the sampling distribution of the o? ’s and use this information to 
improve our estimation for groups that may have low sample sizes. This can be 
done by treating vo and gĝ as parameters to be estimated, in which case (8.4) 
is properly thought of as a sampling model for across-group heterogeneity in 
population variances, and not as a prior distribution. Putting this together 
with our model for heterogeneity in population means gives a hierarchical 
model for both means and variances, which is depicted graphically in Figure 
8.10. 

The unknown parameters to be estimated include {(0,,07),.--,; (0m, 02,) } 
representing the within-group sampling distributions, {1,77} representing 
across-group heterogeneity in means and {vp,02} representing across-group 
heterogeneity in variances. As before, the joint posterior distribution for all of 
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Fig. 8.10. A graphical representation of the hierarchical normal model with het- 
erogeneous means and variances. 


these parameters can be approximated by iteratively sampling each parame- 
ter from its full conditional distribution given the others. The full conditional 
distributions for u and 7? are unchanged from the previous section and the 
full conditional distributions of 6; and o? are given above. What remains to 
do is to specify the prior distributions for vp and o? and obtain their full 
conditional distributions. A conjugate class of prior densities for 0? are the 
gamma densities: If p(o@) ~ gamma(a, b), then it is straightforward to show 
that 


m 
p(ogloz,---,07,,¥0) = dgamma(a + smo, b TI (1/02)) 
t3 fal 
Notice that for small values of a and b the conditional mean of of is approxi- 
mately the harmonic mean of o7,...,07,. 
A simple conjugate prior for vo does not exist, but if we restrict vo to be 
a whole number, then it is easy to sample from its full conditional distribu- 
tion. For example, if we let the prior on vj be the geometric distribution on 
{1,2,...} so that p(vo) x e~°”, then the full conditional distribution of vo is 
proportional to 


p(Voloe,07,---,07,) 
xX p(vo) x poi, oa ., Om vo, 06) 
vo/2—1 
(vo08/2) N" f FT11 l 2 2 
x (Cne Iz x exp{—vo(a + 508 XO(1/02))}. 


While not pretty, this unnormalized probability distribution can be computed 
for a large range of vo-values and then sampled from. For example, the R-code 
to sample from this distribution is as follows: 
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# NUMAX, alpha must be specified , 
# sigma2.schools is the vector of current values 


# of the school specific population variances. 
x<—1 NUMAX 


lpnu0<— m*( .5*xxlog(s20*x/2)—lgamma(x/2) ) + 
(x/2—1)*xsum(log(1/sigma2.schools )) + 
—x*(alpha + .5*s20*xsum(1/sigma2.schools) ) 


nu0<-sample(x,1 , prob=exp (Ipnu0—max(Ipnu0 ) )) 


8.5.1 Analysis of math score data 


Let’s re-analyze the math score data with our hierarchical model for school- 
specific means and variances. We’ll take the parameters in our prior dis- 
tributions to be the same as in the previous section, with a = 1 and 
{a = 1,b = 100} for the prior distributions on vp and o@. After running a 
Gibbs sampler for 5,000 iterations, the posterior distributions of {u, 77, vo, 73} 
are approximated and plotted in Figure 8.11. Additionally, the posterior distri- 
butions of jz and 7? under the hierarchical model with constant group variance 


is shown in gray lines for comparison. 
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Fig. 8.11. Posterior distributions of between-group heterogeneity parameters. 


The hierarchical model of Section 8.3, in which all within-group variances 
are forced to be equal, is equivalent to a value of vp = oo in this hierarchical 
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model. In contrast, a low value of vọ = 1 would indicate that the variances are 
quite unequal, and little information about variances should be shared across 
groups. Our hierarchical analysis indicates that neither of these extremes is 
appropriate, as the posterior distribution of vo is concentrated around a mod- 
erate value of 14 or 15. This estimated distribution of o7,...,02 is used to 
shrink extreme sample variances towards an across-group center, as is shown 
in Figure 8.12. The relationship between sample size and the amount of vari- 
ance shrinkage is shown in the second panel of the plot. As with estimation for 
the group means, the larger amounts of shrinkage generally occur for groups 


with smaller sample sizes. 
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Fig. 8.12. Shrinkage as a function of sample size. 


8.6 Discussion and further references 


Lindley and Smith (1972) laid the foundation for Bayesian hierarchical mod- 
eling, although the idea of shrinking the estimates of the individual group 
means towards an across-group mean goes back at least to Kelley (1927) in 
the context of educational testing. In the statistical literature, the benefits 
of this type of estimation are referred to as the “Stein effect” (Stein, 1956, 
1981). Estimators of this type generally take the form 6, = wj; +(1—w,)y, 
where y is an average over all groups and the w;’s depend on nj, o° and 
T?. So-called empirical Bayes procedures obtain estimates of o? and T? from 
the data, then plug these values into the formula for 6; (Efron and Morris, 
1973; Casella, 1985). Such procedures often yield estimates of the 0,’s that 
are nearly equivalent to those from Bayesian procedures, but ignore uncer- 
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tainty in the values of o? and 7°. For a detailed treatment of empirical Bayes 
methods, see Carlin and Louis (1996). 

Terminology for hierarchical models is inconsistent in the literature. For 
the simple hierarchical model y;,; = 6; +€; j, 0; = u+ yj, the 0;’s (or 7;’s) may 
be referred to as either “fixed effects” or “random effects,” usually depending 
on how they are estimated. The distribution of the 0;’s is unfortunately often 
referred to as a prior distribution, which mischaracterizes Bayesian inference 
and renders the distinction between prior information and population distri- 
bution somewhat meaningless. For a discussion of some of this confusion, see 
Gelman and Hill (2007, pp. 245-246). 

Hierarchical modeling of variances is not common, perhaps due to the 
mean parameters being of greater interest. However, erroneously assuming a 
common within-group variance could lead to improper pooling of information, 
or to the shrinkage of group-specific parameters by inappropriate amounts. 


9 


Linear regression 


Linear regression modeling is an extremely powerful data analysis tool, useful 
for a variety of inferential tasks such as prediction, parameter estimation and 
data description. In this section we give a very brief introduction to the lin- 
ear regression model and the corresponding Bayesian approach to estimation. 
Additionally, we discuss the relationship between Bayesian and ordinary least 
squares regression estimates. 

One difficult aspect of regression modeling is deciding which explanatory 
variables to include in a model. This variable selection problem has a natural 
Bayesian solution: Any collection of models having different sets of regressors 
can be compared via their Bayes factors. When the number of possible regres- 
sors is small, this allows us to assign a posterior probability to each regression 
model. When the number of regressors is large, the space of models can be 
explored with a Gibbs sampling algorithm. 


9.1 The linear regression model 


Regression modeling is concerned with describing how the sampling distribu- 
tion of one random variable Y varies with another variable or set of variables 


£ = (L1,...,£p). Specifically, a regression model postulates a form for p(y|æ), 
the conditional distribution of Y given æ. Estimation of p(y|a) is made using 
data y1,...,Yn that are gathered under a variety of conditions #1,...,&pn. 


Example: Oxygen uptake (from Kuehl (2000)) 


Twelve healthy men who did not exercise regularly were recruited to take part 
in a study of the effects of two different exercise regimen on oxygen uptake. Six 
of the twelve men were randomly assigned to a 12-week flat-terrain running 
program, and the remaining six were assigned to a 12-week step aerobics 
program. The maximum oxygen uptake of each subject was measured (in 
liters per minute) while running on an inclined treadmill, both before and 
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after the 12-week program. Of interest is how a subject’s change in maximal 
oxygen uptake may depend on which program they were assigned to. However, 
other factors, such as age, are expected to affect the change in maximal uptake 
as well. 
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Fig. 9.1. Change in maximal oxygen uptake as a function of age and exercise 
program. 


How might we estimate the conditional distribution of oxygen uptake for 
a given exercise program and age? One possibility would be to estimate a 
population mean and variance for each age and program combination. For 
example, we could estimate a mean and variance from the 22-year-olds in 
the study who were assigned the running program, and a separate mean and 
variance for the 22-year-olds assigned to the aerobics program. The data from 
this study, shown in Figure 9.1, indicate that such an approach is problematic. 
For example, there is only one 22-year-old assigned to the aerobics program, 
which is not enough data to provide information about a population variance. 
Furthermore, there are many age/program combinations for which there are 
no data. 

One solution to this problem is to assume that the conditional distribution 
p(y|a@) changes smoothly as a function of æ, so that data we have at one value 
of a can inform us about what might be going on at a different value. A linear 
regression model is a particular type of smoothly changing model for p(y|x) 
that specifies that the conditional expectation E[Y |æ] has a form that is linear 
in a set of parameters: 
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It is important to note that such a model allows a great deal of freedom for 
£1,- --, 2p. For example, in the oxygen uptake example we could let x; = age 
and x2 = age? if we thought there might be a quadratic relationship between 
maximal uptake and age. However, Figure 9.1 does not indicate any quadratic 
relationships, and so a reasonable model for p(y|x) could include two different 
linear relationships between age and uptake, one for each group: 


Y; = 81041 + b2£i 2 + B3213 + Gavia + éi , where (9.1) 
z; = 1 for each subject ¢ 


24,2 = 0 if subject 7 is on the running program, 1 if on aerobic 
24,3 = age of subject i 


Tig = V2 X Ti,3 


Under this model the conditional expectations of Y for the two different levels 
of xi ı are 


E[Y |x] = 6, + 3 x age if zı = 0, and 
E[Y |æ] = (G1 + G2) + (63+ p4) x age if xı = 1. 


In other words, the model assumes that the relationship is linear in age for 
both exercise groups, with the difference in intercepts given by $2 and the 
difference in slopes given by (34. If we assumed that G2 = G4 = 0, then we 
would have an identical line for both groups. If we assumed 3, = 0 then 
we would have a different line for each group but they would be parallel. 
Allowing all coefficients to be non-zero gives us two unrelated lines. Some 
different possibilities are depicted graphically in Figure 9.2. 

We still have not specified anything about p(y|a) beyond E[Y |æ]. The nor- 
mal linear regression model specifies that, in addition to E[Y|a] being linear, 
the sampling variability around the mean is i.i.d. from a normal distribution: 


€1,..+,€, ~ iid. normal(0, o°) 
Y, = Bri + éi. 

This model provides a complete specification of the joint probability density 
of observed data y1,...,Yn conditional upon £1,..., £n and values of 8 and 
g?: 

P(Y,- , Yn|E£1, -- : Ln, B, 0°) (9.2) 

= [[ pile, B, 07) 

i=1 
1 n 
_ 2)—n/2 bo a aT, \2 
= (2007)-"/? exp{—s5 dui B" as)"}. (9.3) 


Another way to write this joint probability density is in terms of the mul- 
tivariate normal distribution: Let y be the n-dimensional column vector 
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Fig. 9.2. Least squares regression lines for the oxygen uptake data, under four 
different models. 


(Yi, ---, Yn)”, and let X be the n x p matrix whose ith row is x;. Then the 
normal regression model is that 
{y|X, B,o7} ~ multivariate normal (XG, 071), 
where I is the p x p identity matrix and 
Tı `> 
tot By Ait11 ++ + BpZ1,p E[Y1 |G, xı] 
X8=| : |= : = : 
Bp Pitna See Sie Br&n,p E[Yn]6, Ln] 


Ln `> 


The density (9.3) depends on 8 through the residuals (y; — BT a;). Given 
the observed data, the term in the exponent is maximized when the sum of 
squared residuals, SSR(@) = X; (yı— 8” x)? is minimized. To find the value 


9.1 The linear regression model 153 


of B at which this minimum occurs it is helpful to rewrite SSR(G) in matrix 
notation: 


nm 


SSR(B) = $ (yi — 872i)? = (y — XB)” (y — XB) 


i=l 


= yy — 287 X" y+ BX XB. 
Recall from calculus that 


1. a minimum of a function g(z) occurs at a value z such that 49(z) = 0; 
2. the derivative of g(z) = az is a and the derivative of g(z) = bz? is 2bz. 


These facts translate over to the multivariate case, and can be used to obtain 
the minimizer of SSR(@): 


d d T TyT TyT 
qa°SR(8) = gg (uty 267 X"y + BX" XB) 


= —2XTy + 2X7X£ , therefore 


d 
BR) =0 6 -2XTy + 2XTX6 = 0 


@ XTXB = XTy 

& B = (XTX) XTy. 
The value Ê s = (XTX) -tXTy is called the “ordinary least squares” (OLS) 
estimate of 6, as it provides the value of 6B that minimizes the sum of squared 
residuals. This value is unique as long as the inverse (X’X)~! exists. The 
value 68 
section. 


ols also plays a role in Bayesian estimation, as we shall see in the next 


9.1.1 Least squares estimation for the oxygen uptake data 


Let’s find the least squares regression estimates for the model in Equation 9.1, 
and use the results to evaluate differences between the two exercise groups. 
The ages of the 12 subjects, along with their observed changes in maximal 
oxygen uptake, are 


£3 = (23, 22, 22, 25,27, 20, 31, 23, 27, 28, 22, 24) 
y = (—0.87, —10.74, —3.27, -1.97, 7.50, —7.25, 17.05, 4.96, 10.40, 
11.05, 0.26, 2.51) , 


with the first six elements of each vector corresponding to the subjects in 
the running group and the latter six corresponding to subjects in the aerobics 
group. After constructing the 12 x 4 matrix X out of the vectors £1, £2, £3, L4 
defined as in (9.1), the matrices XTX and X7y can be computed: 
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12 6 294 155 29.63 
ry | 6 6 155 155 rv | 46.23 
X X= | 294 155 7314 4063 | * Y= | 978.81 

155 155 4063 4063 1298.79 


Inverting the XTX matrix and multiplying the result by X7 y give the vector 
Bos = (—51.29, 13.11, 2.09, —.32)7. This means that the estimated linear rela- 
tionship between uptake and age has an intercept and slope of -51.29 and 2.09 
for the running group, and —51.29 + 13.11 = —38.18 and 2.09 — 0.32 = 1.77 
for the aerobics group. These two lines are plotted in the fourth panel of Fig- 
ure 9.2. An unbiased estimate of o? can be obtained from SSR(G,),)/(n — p), 
which for these data gives 6?7,, = 8.54. The sampling variance of the vector 
Âo can be shown to be equal to (X7X)~!o?. We do not know the true value 
of a”, but the value of ô, can be plugged in to give standard errors for the 
components of Bote These are 12.25, 15.76, 0.53 and 0.65 for the four regres- 
sion coefficients in order. Comparing the values of c= to their standard errors 
suggests that the evidence for differences between the two exercise regimen is 
not very strong. We will explore this further in the next few sections. 


9.2 Bayesian estimation for a regression model 


We begin with a simple semiconjugate prior distribution for 9 and g? to be 
used when there is information available about the parameters. In situations 
where prior information is unavailable or difficult to quantify, an alternative 
“default” class of prior distributions is given. 


9.2.1 A semiconjugate prior distribution 


The sampling density of the data (Equation 9.3), as a function of 8, is 
1 
P(y|X, B; a”) x exp{— 5g SSR(B)} 


1 
= exp{—salu"y — 26" X"y + XT Xp}. 


The role that @ plays in the exponent looks very similar to that played by 
y, and the distribution of y is multivariate normal. This suggests that a 
multivariate normal prior distribution for G is conjugate. Let’s see if this is 
correct: If 6 ~ multivariate normal(G,), Xo), then 


p(Bly, X, 07) 
x p(y|X, 8,07) x p(B) 


x exp{—5(-28"X" y/o? + 6° X'? XB/o*) — 5(-28" 55 1B + B" 5 'B)} 


= exp{6" (Spo +X" y/o?) — 58" (Sp? + XX /o?) 8}. 
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Referring back to Chapter 7, we recognize this as being proportional to a 
multivariate normal density, with 

Var[B|y, X, 07] = (291+ X7X/o*)-} (9.4) 

E[Aly, X, 07] = (£5 + X*X/o7) (Zp By + X7 y/o”). (9.5) 


As usual, we can gain some understanding of these formulae by considering 
some limiting cases. If the elements of the prior precision matrix Xg 1 are small 
in magnitude, then the conditional expectation E[G|y, X, 07] is approximately 
equal to (X”7X)~!X7y, the least squares estimate. On the other hand, if the 
measurement precision is very small (ø? is very large), then the expectation 
is approximately Go, the prior expectation. 

As in most normal sampling problems, the semiconjugate prior distribution 
for ø? is an inverse-gamma distribution. Letting y = 1/0? be the measurement 
precision, if y ~ gamma(vo/2, 90? /2), then 


Ply, X, B) x p(y)p(ylX, B.) 
o [40/21 exp(—y x voa3/2)| x [1"/? exp(—7 x SSR(8)/2) 
= YFP exp(—y[yo09 + SSR(B)]/2), 
which we recognize as a gamma density, so that 
{o*|y, X, B} ~ inverse-gamma(([v + n]/2, [voci + SSR()]/2). 


Constructing a Gibbs sampler to approximate the joint posterior distribution 
p(B, o?\y, X) is then straightforward: Given current values {8 02(9)}, new 
values can be generated by 


1. updating 6: 
a) compute V = Var[Gly, X,0?"*)] and m = E[Gly, X, 0? 
b) sample 8+) ~ multivariate normal(m, V) 
2. updating o°: 
a) compute SSR(B°t))) 
b) sample o2(*+)) ~ inverse-gamma([vp + n]/2, [voo? + SSR(BS*Y)]/2). 


9.2.2 Default and weakly informative prior distributions 


A Bayesian analysis of a regression model requires specification of the prior 
parameters (Bo, Xo) and (vo, oê). Finding values of these parameters that 
represent actual prior information can be difficult. In the oxygen uptake ex- 
periment, for example, a quick scan of a few articles on exercise physiology 
indicates that males in their 20s have an oxygen uptake of around 150 liters 
per minute with a standard deviation of 15. If we take 150+2 x 15 = (120, 180) 
as a prior expected range of the oxygen uptake distribution, then the changes 
in oxygen uptake should lie within (-60,60) with high probability. Considering 
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our subjects in the running group, this means that the line 81 + (3x2 should 
produce values between -60 and 60 for all values of x between 20 and 30. A 
little algebra then shows that we need a prior distribution on (81, 83) such 
that —300 < 6, < 300 and —12 < (3 < 12 with high probability. This could 
be done by taking Xo,1,ı = 150? and Xo,2,2 = 67, for example. However, we 
would still need to specify the prior variances of the other parameters, as well 
as the six prior correlations between the parameters. The task of constructing 
an informative prior distribution only gets harder as the number of regressors 
increases, as the number of prior correlation parameters is 6), which increases 
quadratically in p. 

Sometimes an analysis must be done in the absence of precise prior infor- 
mation, or information that is easily converted into the parameters of a con- 
jugate prior distribution. In these situations one could stick to least squares 
estimation, with the drawback that probability statements about @ would be 
unavailable. Alternatively, it is sometimes possible to justify a prior distri- 
bution with other criteria. One idea is that, if the prior distribution is not 
going to represent real prior information about the parameters, then it should 
be as minimally informative as possible. The resulting posterior distribution 
would then represent the posterior information of someone who began with 
little knowledge of the population being studied. To some, such an analysis 
would give a “more objective” result than using an informative prior distri- 
bution, especially one that did not actually represent real prior information. 
One type of weakly informative prior is the unit information prior (Kass and 
Wasserman, 1995). A unit information prior is one that contains the same 
amount of information as that would be contained in only a single observa- 
tion. For example, the precision of Êo is its inverse variance, or (X7X)/o?. 
Since this can be viewed as the amount of information in n observations, the 
amount of information in one observation should be “one nth” as much, i.e. 
(X7X)/(no?). The unit information prior thus sets Yo’ = (K7X)/(no?). 
Kass and Wasserman (1995) further suggest setting Bo = Bois, thus centering 
the prior distribution of 8 around the OLS estimate. Such a distribution can- 
not be strictly considered a real prior distribution, as it requires knowledge of 
y to be constructed. However, it only uses a small amount of the information 
in y, and can be loosely thought of as the prior distribution of a person with 
unbiased but weak prior information. In a similar way, the prior distribution 
of o? can be weakly centered around 62), by taking vp = 1 and of = ĉĉ- 

Another principle for constructing a prior distribution for 8 is based on 
the idea that the parameter estimation should be invariant to changes in the 
scale of the regressors. For example, suppose someone were to analyze the 
oxygen uptake data using č; 3 = age in months, instead of x;,3 = age in years. 
It makes sense that our posterior distribution for 12 x 33 in the model with 
Z;,3 should be the same as the posterior distribution for 33 based on the model 
with z; 3. This condition requires, for example, that the posterior expected 
change in y for a year change in age is the same, whether age is recorded in 
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terms of months or years. More generally, suppose X is a given set of regressors 
and X = XH for some p x p matrix H. If we obtain the posterior distribution 
of B from y and X, and the posterior distribution of B from y and X, then, 
according to this principle of invariance, the posterior distributions of 8 and 
H@ should be the same. Some linear algebra shows that this condition will 
be met if Bọ = 0 and Xo = k(XTX)! for any positive value k. A popular 
specification of k is to relate it to the error variance o”, so that k = go? for 
some positive value g. These choices of prior parameters result in a version 
of the so-called “g-prior” (Zellner, 1986), a widely studied and used prior 
distribution for regression parameters (Zellner’s original g-prior allowed (3, to 
be non-zero). Under this invariant g-prior the conditional distribution of 3 
given (y, X, 07) is still multivariate normal, but Equations 9.4 and 9.5 reduce 
to the following simpler forms: 


Var[Bly, X, 0°] = [K*X/(go?) + XTX/o°] 
Ss y 2 —1 
= (XTX) (9.6) 
E[Bly, X, 0°] = [X"X/(g0°) +X"X/0°] XT y/o? 
= (XTX) !Xľy. (9.7) 
g+1 


Parameter estimation under the g-prior is simplified as well: It turns out that, 
under this prior distribution, p(o?°|y,X) is an inverse-gamma distribution, 
which means that we can directly sample (7, 3) from their posterior distri- 
bution by first sampling from p(c?|y, X) and then from p(@\o?, y, X). 


Derivation of p(o?|y, X) 


The marginal posterior density of o? is proportional to p(o?) x p(y|X, 07). 
Using the rules of marginal probability, the latter term in this product can be 
expressed as the following integral: 


p(ylX,07) = / p(ylX, B,02)p(B|X, 0?) dB. 


Writing out the two densities inside the integral, we have 


(ulX,8,02)p(BIX, 02) = (2x0?)"/? exp|—5"5(y - XB)" (y — XB) x 


l2mgo2(X7X)~"|-" exp[— 


zg07® X XA]. 


Combining the terms in the exponents gives 
[(y - X8)" - XB) + 87X?XB/g 


1 

= -zz [yy — 2y"XB + BT XTXA(L + 1/9) 
1 1 = 1 = 

gat u= 5(8-m)'V"'(8-m)+ a V, 


20? 
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where E g 
V= 2(XTX)! and m = —— (XTX) Xy. 
eel ( )~* and m PESA ) y 


This means that we can write p(y|, X)p(G|X, z,o?) as 


|(2no?)-"/exp(—steuy)] x [0 + o) esp Gn Vin) | x 
[ervi-t expl-5(6 - m)?V-1(6 ~ m). 


The third term in the product is the only term that depends on 6. This term is 
exactly the multivariate normal density with mean m and variance V, which 
as a probability density must integrate to 1. This means that if we integrate 
the whole thing with respect to B we are left with only the first two terms: 


plux, o?) = f pluB: XBX, 0%) d8 
= fero?" ep zayo) x ja +g)??? exp(gm?V~'m) 
which, after combining the terms in the exponents, is 
PULZ, 0?) = (2n)-"/?(1 + 9)-?/(o2)-"/? exp(—s-g88Ry), 
where SSR,g is defined as 
SSR, = yTy —m'V-'m = yT (I- eT (XTX) Xp. 


The term SSR, decreases to SSRois = So (yi — Borzi)? as g — oo. The effect 
of g is that it shrinks down the magnitude of the regression coefficients and 
can prevent overfitting of the data. 

The last step in identifying p(o°|y, X) is to multiply p(y|X,o7) by the 
prior distribution. Letting y = 1/0? ~ gamma(ro/2, vooĝ/2), we have 


P(yly, X) x p(y)p(y|X, 7y) 
oc [y-t exp(—y x voo /2)| x |a”? exp(—7 x SSRg/2) 
= qo tn)/2-1 exp|—y x (voo? + SSR4)/2] 
x dgamma(y, [vo + n]/2, [voo + SSR4]/2) , 


and so {0°|y, X} ~ inverse-gamma([vo + n]/2, [voo + SSR4]/2). These cal- 
culations, along with Equations 9.6 and 9.7, show that under this prior dis- 
tribution, p(o?°|y, X) and p(B|y, X, c?) are inverse-gamma and multivariate 
normal distributions respectively. Since we can sample from both of these dis- 
tributions, samples from the joint posterior distribution p(o?, Bly, X) can be 
made with Monte Carlo approximation, and Gibbs sampling is unnecessary. 
A sample value of (o°, 3) from p(o?, Bly, X) can be made as follows: 
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1. sample 1/o? ~ gamma([vp + n]/2, [voo + SSR,|/2); 


2. sample 6 ~ multivariate normal( {5 Bais; P XTX). 


R-code to generate multiple independent Monte Carlo samples from the pos- 
terior distribution is below: 


data(chapter9) ; y<-yX.o2uptake[,1] ; X<-yX.o2uptake|[,-—1] 
g<-length(y) ; nu0<—1 ; s20<—8.54 
S<—1000 


B= letes WW IX 
##f prior parameters: g, nu0, s20 
## number of independent samples to generate: S 


n<—dim(X)[1] ; p<-dim(X) [2] 
Hex— (g/(g+1)) *« X%%solve (t (X)%*%X)%e%t (X) 
SSRe< t(y)%*%( diag(1,nrow=n) — Hg ) Ay 


s2<—1/rgamma(S, (nu0+n)/2, (nu0*s20+SSRg)/2 ) 


Vb<— gxsolve (t (X)%*x%X) /(g+1) 
Eb Vb%e%t (X)%x%y 


Ex-matrix(rnorm(S*p,0,sqrt(s2)),S,p) 
beta<—t( t(E%*%chol(Vb)) +c(Eb)) 


Bayesian analysis of the oxygen uptake data 


We will use the invariant g-prior with g = n = 12, w = 1 and o2 = 
62, = 8.54. The posterior mean of 3 can be obtained directly from Equa- 
tion 9.7: Since E[G|y, X,o7] does not depend on o?, we have E[G|y, X] = 
E[Bly, X, 07] = oa Botes so the posterior means of the four regression param- 
eters are 12 x (—51.29, 13.11, 2.09, —0.32)/13 = (—47.35, 12.10, 1.93, —0.29). 
Posterior standard deviations of these parameters are (14.41, 18.62, 0.62, 
0.77), based on 1,000 independent Monte Carlo samples generated using the 
R-code above. The marginal and joint posterior distributions for (62, 34) are 
given in Figure 9.3, along with the (marginal) prior distributions for com- 
parison. The posterior distributions seem to suggest only weak evidence of 
a difference between the two groups, as the 95% quantile-based posterior in- 
tervals for G2 and (4 both contain zero. However, these parameters taken by 
themselves do not quite tell the whole story. According to the model, the av- 
erage difference in y between two people of the same age x but in different 
exercise programs is 32+ 34x. Thus the posterior distribution for the effect of 
the aerobics program over the running program is obtained via the posterior 
distribution of 62 + @4x for each age x. Boxplots of these posterior distribu- 
tions are shown in Figure 9.4, which indicates reasonably strong evidence of 
a difference at young ages, and less evidence at the older ones. 
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Fig. 9.3. Posterior distributions for G2 and (31, with marginal prior distributions in 
the first two plots for comparison. 


15 


10 


ri 
s 
T 
| 
=H[H 
[HH 
IS 
[m 
EE 
BE 
| 


0 

| 

H 
F 
H 


-5 
L 
f 


-10 


| | | | | 
20 22 24 26 28 30 
age 


Fig. 9.4. Ninety-five percent confidence intervals for the difference in expected 
change scores between aerobics subjects and running subjects. 


9.3 Model selection 


Often in regression analysis we are faced with a large number of possible re- 
gressor variables, even though we suspect that a majority of the regressors 
have no true relationship to the variable Y. In these situations, including all 
of the possible variables in a regression model can lead to poor statistical 
performance. Standard statistical advice is that we should include in our re- 
gression model only those variables for which there is substantial evidence 
of an association with y. Doing so not only produces simpler, more aesthet- 
ically pleasing data analyses, but also generally provides models with better 
statistical properties in terms of prediction and estimation. 
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Example: Diabetes 


Baseline data for ten variables x1,..., £10 on a group of 442 diabetes patients 
were gathered, as well as a measure y of disease progression taken one year af- 
ter the baseline measurements. From these data we hope to make a predictive 
model for y based on the baseline measurements. While a regression model 
with ten variables would not be overwhelmingly complex, it is suspected that 
the relationship between y and the x;’s may not be linear, and that includ- 
ing second-order terms like xv and x;x, in the regression model might aid 


in prediction. The regressors therefore include ten main effects 21,..., £10, 
2 


(3) = 45 interactions of the form x;2, and nine quadratic terms æ? (one 
of the regressors, xo = sex, is binary, so x = x3, making it unnecessary to 
include z2 ). This gives a total of p = 64 regressors. To help with the inter- 
pretation of the parameters and to put the regressors on a common scale, all 
of the variables have been centered and scaled so that y and the columns of 
X all have mean zero and variance one. 

In this section we will build predictive regression models for y based on 
the 64 regressor variables. To evaluate the models, we will randomly divide 
the 442 diabetes subjects into 342 training samples and 100 test samples, 
providing us with a training dataset (y, X) and a test dataset (Yrest? Xtest )- 
We will fit the regression model using the training data and then use the 
estimated regression coefficients to generate rest = X test. The performance 
of the predictive model can then be evaluated by comparing Yrest tO Yrest- Let’s 
begin by building a predictive model with ordinary least squares regression 
with all 64 variables. The first panel of Figure 9.5 plots the true values of the 
100 test samples Ytest versus their predicted values rest = Xtest3, where B 
was estimated using the 342 training samples. While there is clearly a positive 
relationship between the true values and the predictions, there is quite a bit of 
error: The average squared predictive error is 86 S (Ytest,i — Drest, i)” = 0.67, 
whereas if we just predicted each test case to be zero, our predictive error 
would be 795 >. Yeast, = 0.97. 


Ytest 


0 10 30 50 
regressor index 


Fig. 9.5. Predicted values and regression coefficients for the diabetes data. 
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The second panel of Figure 9.5 shows the estimated values of each of the 64 
regression coefficients. Of note is that the majority of coefficients are estimated 
to be quite small. Perhaps our predictions could be improved by removing from 
the regression model those variables that show little evidence of being non- 
zero. By doing so, we hope to remove from the predictive model any regressors 
that have spurious associations to Y (i.e. those associations specific only to 
the training data), leaving only those regressors that would have associations 
for any group of subjects (i.e. both the training and test data). One standard 
way to assess the evidence that the true value of a regression coefficient (3; is 
not zero is with a t-statistic, which is obtained by dividing the OLS estimate 
Âj by its standard error, so tj = ĝi cag (XTX) eed ae We might then consider 
removing from the model those regressor variables with small absolute values 
of t;. For example, consider the following procedure: 


1. Obtain the estimator Bos = (X7X)~!X7y and its t-statistics. 
2. If there are any regressors j such that |t;| < teutoff, 
a) find the regressor jmin having the smallest value of |t;| and remove 
column jmin from X. 
b) return to step 1. 


3. If |t; | > teutor for all variables j remaining in the model, then stop. 


Such procedures, in which a potentially large set of regressors is reduced to 
a smaller set, are called model selection procedures. The procedure defined in 
steps 1, 2 and 3 above describes a type of backwards elimination procedure, 
in which all regressors are initially included but then are iteratively removed 
until the remaining regressors satisfy some criterion. A standard choice for 
teutort is an upper quantile of a t or standard normal distribution. If we apply 
the above procedure to the diabetes data with teuto = 1.65 (corresponding 
roughly to a p-value of 0.10), then 44 of the 64 variables are eliminated, 
leaving 20 variables in the regression model. The third plot of Figure 9.5 
ShOWS Yrest Versus predicted values based on the reduced-model regression 
coefficients. The plot indicates that the predicted values from this model are 
more accurate than those from the full model, and indeed the average squared 
predictive error is 745 >(Ytest,i — Gtest,i)” = 0.53. 

Backwards selection is not without its drawbacks, however. What sort of 
model would this procedure produce if there were no association between Y 
and any of the regressors? We can evaluate this by creating a new data vector 
y by randomly permuting the values of y. Since in this case the value of a; has 
no effect on y;, the “true” association between y and the columns of X is zero. 
However, the OLS regression model will still pick up spurious associations: 
The first panel of Figure 9.6 shows the t-statistics for one randomly generated 
permutation y of y. Initially, only one regressor has a t-statistic greater than 
1.65, but as we sequentially remove the columns of X the estimated variance of 
the remaining regressors decreases and their t-statistics increase in value. With 
teuto = 1.65, the procedure arrives at a regression model with 18 regressors, 
17 of which have t-statistics greater than 2 in absolute value, and four of which 
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have statistics greater than 3. Even though y was generated without regard to 
X, the backwards selection procedure erroneously suggests that many of the 
regressors do have an association. Such misleading results are fairly common 
with backwards elimination and other sequential model selection procedures 
(Berk, 1978). 
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Fig. 9.6. t-statistics for the regression of ğ on X, before and after backwards elim- 
ination. 


9.3.1 Bayesian model comparison 


The Bayesian solution to the model selection problem is conceptually straight- 
forward: If we believe that many of the regression coefficients are potentially 
equal to zero, then we simply come up with a prior distribution that reflects 
this possibility. This can be accomplished by specifying that each regression 
coefficient has some non-zero probability of being exactly zero. A convenient 
way to represent this is to write the regression coefficient for variable j as 
Bi = zj X bj, where z; € {0,1} and 6; is some real number. With this param- 
eterization, our regression equation becomes 


Ys = 2101041 H- -= + Zpbpti,p + ĉi- 


The z;’s indicate which regression coefficients are non-zero. For example, in 
the oxygen uptake problem, 
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E[Y a, b, z = (1,0,1,0)] = bızı + bgx3 
= bı + b3 x age 
E[Y |a, b, z = (1,1,0,0)] = bızı + box 
= bı + b2 X group 
E[Y |x, b, z = (1,1,1,0)] = bia + bere + b3xrg 
= bı + b2 X group + bg x age. 


Each value of z = (z1,..., Zp) corresponds to a different model, or more specif- 
ically, a different collection of variables having non-zero regression coefficients. 
For example, we say that the model with z = (1,0,1,0) is a linear regression 
model for y as a function of age. The model with z = (1,1,1,0) is referred 
to as a regression model for y as a function of age, but with a group-specific 
intercept. With this parameterization, choosing which variables to include in 
a regression model is equivalent to choosing which z;’s are 0 and which are 1. 

Bayesian model selection proceeds by obtaining a posterior distribution 
for z. Of course, doing so requires a joint prior distribution on {z, 3,07}. It 
turns out that a version of the g-prior described in the previous section allows 
us to evaluate p(y|X, z) for each possible model z. Given a prior distribution 
p(z) over models, this allows us to compute a posterior probability for each 
regression model: 
_ _ plz)p(y[X, z) 

Ye PPIX, 2) 


Alternatively, we can compare the evidence for any two models with the pos- 
terior odds: 


p(zly, X) 


p(zaly,X) _— p(za) P(y|X, Za) 
odds(z,, zly, X) = = x OA 

Fa 20h) = auX ples) * pa, z) 
posterior odds = prior odds x “Bayes factor” 


The Bayes factor can be interpreted as how much the data favor model z, 
over model z». In order to obtain a posterior distribution over models, we will 
have to compute p(y|X, z) for each model z under consideration. 


Computing the marginal probability 
The marginal probability is obtained from the integral 
plylX,2) = | f olu. B:oIX, 2) d8ao? 
= | f via, x)p(@1X,z,0°)p(0%) dB de®. (98) 
Using a version of the g-prior distribution for B, we will be able to compute 


this integral without needing much calculus. For any given z with p, non-zero 
entries, let X, be the n x p, matrix corresponding to the variables j for which 
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zj = 1, and similarly let G, be the p, x 1 vector consisting of the entries of 
B for which z; = 1. Our modified g-prior distribution for @ is that 6; = 0 for 
j’s such that zj = 0, and that 


{@,|Xz,07} ~ multivariate normal (0, go?[K,’X.]~'). 


If we integrate (9.8) with respect to 8 first, we have 
pyle) = f ( f WOX, 20°, BPX 2.0%) dB) plo?) do? 
= J PIX. z o?) do?. 


The form for the marginal probability p(y|X, z,o?) was computed in the last 
section. Using those results, writing y = 1/07 and letting p(y) be the gamma 
density with parameters (v9 /2,vo02/2), we can show that conditional density 
of (y, y) given (X, z) is 


plylX, 2,7) x p(y) = (2r) (L g) x [yems] x 


(1993 /2)"°/?F (vo /2)-? [70/2 2e-™70/?] , (9.9) 
where SSR; is as in the last section except based on the regressor matrix Xz: 


ee ges os g Z i —1 
SSR = y” (I — Jož: X.) xX): 


+1 


The part of Equation 9.9 that depends on y is proportional to a gamma 
density, but in this case the normalizing constant is the part that we need: 


yo ™/2-l exp|—y x (voo? + SSR2)/2] = 
I((vo + n]/2) 
((voo? Ji SSRZ]/2)@o+n)/2-1 


x dgammaly, (vo + n)/2, (voog + SSR) /2]- 


Since the gamma density integrates to 1, the integral of the left-hand side of 
the above equation must be equal to the constant on the right-hand side. Mul- 
tiplying this constant by the other terms in Equation 9.9 gives the marginal 
probability we are interested in: 


(v08)? 
(voc? + SSRZ)(o+n)/2 ` 


-ny2 T (lvo + n]/2) 
P (vo/2) 


p(y|X, z) = 7 (1 +g)” 


Now suppose we set g = n and use the unit information prior for p(a?) for each 
model z, so that vp = 1 for all z, but øĝ is the estimated residual variance 
under the least squares estimate for model z. In this case, the ratio of the 


probabilities under any two models z, and z, is 
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1/2 za \ (nt1)/2 
P(y|X, Za) =(i.4 n) (P2 —Pza)/2 Sza i 2 5%, + SSR? 
l s2 s2, + SSR? ` 


Notice that the ratio of the marginal probabilities is essentially a balance 
between model complexity and goodness of fit: A large value of p,, compared 
to pz, penalizes model Zp, although a large value of SSR{* compared to SSRẸ? 
penalizes model Zza. 


Oxygen uptake example 
Recall our regression model for the oxygen uptake data: 


E[Y;|6, xi] = Bixi + Borin + Petig + Badia 
= ı + 2X group; + b3 x age; + 84 X group; x age; . 


The question of whether or not there is an effect of group translates into the 
question of whether or not 32 and (4 are non-zero. Recall from our analyses 
in the previous sections that the estimated magnitudes of G2 and (4 were 
not large compared to their standard deviations, suggesting that maybe there 
is not an effect of group. However, we also noticed from the joint posterior 
distribution that G2 and 84 were negatively correlated, so whether or not (4 
is zero affects our information about (4. 


z [model |log p(y|X, z)|p(z|y, X) 
(1,0,0,0)| 1 -44.33 0.00 
(1,1,0,0)|61 + G2 x group; -42.35 0.00 
(1,0,1,0)|1 + Bs x age; -37.66 0.18 
(1,1,1,0)|G1 + G2 x group; + G3 x age; -36.42 0.63 
(1,1,1,1)|@1 + G2 x group; + 83 x age; + G4 X group; x age; -37.60 0.19 


Table 9.1. Marginal probabilities of the data under five different models. 


We can formally evaluate whether 62 or 84 should be zero by comput- 
ing the probability of the data under a variety of competing models. Table 
9.1 lists five different regression models that we might like to consider for 
these data. Using the g-prior for B with g = n, and a unit information prior 
distribution for ø? for each value of z, the values of log p(y|X, z) can be com- 
puted for each of the five values of z we are considering. If we give each of 
these models equal prior weight, then posterior probabilities for each model 
can be computed as well. These calculations indicate that, among these five 
models, the most probable model is the one corresponding z = (1,1,1,0), 
having a slope for age with a separate intercept for each group. The evidence 
for an age effect is strong, as the posterior probabilities of the three models 
that include age essentially sum to 1. The evidence for an effect of group is 
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weaker, as the combined probability for the three models with a group effect is 
0.00+0.63+0.19=0.82. However, this probability is substantially higher than 
the corresponding prior probability of 0.20+0.20+0.20=0.60 for these three 
models. 


9.3.2 Gibbs sampling and model averaging 


If we allow each of the p regression coefficients to be either zero or non-zero, 
then there are 2? different models to consider. If p is large, then it will be 
impractical for us to compute the marginal probability of each model. The 
diabetes data, for example, has p = 64 possible regressors, so the total number 
of models is 264 ~ 1.8x 101°. In these situations our data analysis goals become 
more modest: For example, we may be content with a decent estimate of G 
from which we can make predictions, or a list of relatively high-probability 
models. These items can be obtained with a Markov chain which searches 
through the space of models for values of z with high posterior probability. 
This can be done with a Gibbs sampler in which we iteratively sample each zj 
from its full conditional distribution. Specifically, given a current value z = 
(Z1,.--,%p), a new value of z; is generated by sampling from p(z,;|y, X, z_,;), 
where z_, refers to the values of z except the one corresponding to regressor 
j. The full conditional probability that z; is 1 can be written as 0;/(1+ 05), 
where oj is the conditional odds that zj is 1, given by 


Pr(z; = lly, X23) = Pr(zj = 1) x p(Y|X, Z-j, Zj = 1) 
Pr(zj = Oly, X, z_;) Pr(zj = 0) ply|X, z_5, Zj = 0) 


0j = 


We may also want to obtain posterior samples of 3 and o?. Using the results 
of Section 9.2, values of these parameters can be sampled directly from their 
conditional distributions given z, y and X: For each z in our MCMC sample, 
we can construct the matrix X, which consists of only those columns j cor- 
responding to non-zero values of z;. Using this matrix of regressors, a value 
of o? can be sampled from p(c?|X, y, z) (an inverse-gamma distribution) and 
then a value of 3 can be sampled from p(G|X, y, z,o?) (a multivariate normal 
distribution). Our Gibbs sampling scheme therefore looks something like the 
following: 
2 (8) —> g2(5) — 3°) 


2(st1) => g2(s+1) — +1) 
More precisely, generating values of {z+ o(+), 86+} from z() is 
achieved with the following steps: 
1. Set z = 2); 
2. For j € {1,...,p} in random order, replace z; with a sample from 


p(z;|Z-7,y, X); 
3. Set zt) = z; 
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4. Sample 026+) ~ p(a?|zt), y, X); 
5. Sample Ber? Di eer), gern Y, X). 


Note that the entries of z*+)) are not sampled from their full conditional dis- 
tributions given a?) and B®). This is not a problem: The Gibbs sampler for z 
ensures that the distribution of z(°) converges to the target posterior distribu- 
tion p(zly, X). Since (o°), B)) are direct samples from p(o?, Blz“), y, X), 
the distribution of (02s), 8) converges to p(o?, Bly, X). R-code to imple- 
ment the Gibbs sampling algorithm in z, along with a function Ipy.X that 
calculates the log of p(y|X), is below. This code can be combined with the 
code in the previous section in order to generate samples of {z,o7, 3} from 
the joint posterior distribution. 
HAHH a function to compute the marginal probability 
lpy .X<-function(y,X,g=length(y), 

nu0=1,s20=try (summary (Im(y —1+X)) $sigma*2,silent=IPRUE) ) 
4 


n<-dim(X)[1] ; p<—dim(X)[2] 

if (p==0) { Hg<—0 ; s20<-mean(y 2) 

if(p>0) { Hg<—(g/(g+1)) « X%«%solve (t (X)%x*x%X)%xHt (KX) } 
SSRe< t(y)%*%( diag (1,nrow=n) — Hg )%«%y 


—.5*( n*log (pi)+px*log(1+g)+(nu0+n) * log (nu0*«s20+SSRg)— 
nu0xlog(nu0*«s20) ) + 
Igamma( (nu0+n)/2 ) — lgamma(nu0/2) 
I 
TARA 


HHHH Starting values and MCMC setup 
z<rep(1,dim(X)[2] ) 

Ipy .c<lpy .X(y ,X| , z==1,drop=FALSE] ) 
S<-—10000 

Z<—matrix(NA,S,dim(X) [2] ) 

THA 


HAHH Gibbs sampler 
for(s in 1:8) 


oe in sample (1:dim(X) [2])) 
zp<-z ; zp[j]<—1—zp|j] 
Ipy ._p<-lpy .X(y ,X[ , zp==1,drop=FALSE] ) 
r< (lpy.p — Ipy.c)*(—1)*(zp[j]==0) 
z|j|<—rbinom(1,1,1/(1+exp(-r ))) 
if (z|j]==2p[j]) {lpy .c<Ipy -p} 
Z|s,|<-z 
i 
PHHH 
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Diabetes example 


Using a uniform distribution on z, the Gibbs sampling scheme described above 
was run for S = 10,000 iterations, generating 10,000 samples of (z, 07, 3) 
which we can use to approximate the posterior distribution p(z, 07, Bly, X). 
How good will our approximation be? Recall that with p = 64 the total num- 
ber of models, or possible values of z, is 2° ~ 1019, which is 101° times as 
large as the number of approximate posterior samples we have. It should then 
not be too much of a surprise that, in the 10,000 MCMC samples of z, only 
32 of the possible models were sampled more than once: 28 models were sam- 
pled twice, two were sampled three times and two others were sampled five 
and six times. This means that, for large p, the Gibbs sampling scheme pro- 
vides a poor approximation to the posterior distribution of z. Nevertheless, in 
many situations where most of the regressors have no effect the Gibbs sampler 
can still provide reasonable estimates of the marginal posterior distributions 
of individual z,;’s or 8;’s. The first panel of Figure 9.7 shows the estimated 
posterior probabilities Pr(z; = 1|y,X) for each of the 64 regressors. There 
are six regression coefficients having a posterior probability higher than 0.5 
of being non-zero. These six regressors are a subset of the 20 that remained 
after the backwards selection procedure described above. How well does this 
Bayesian approach do in terms of prediction? As usual, we can approximate 
the posterior mean of 3 with Bina = Si B® /S. This parameter estimate 
is sometimes called the (Bayesian) model averaged estimate of B, because it 
is an average of regression parameters from different values of z, i.e. over dif- 
ferent regression models. This estimate, obtained by averaging the regression 
coefficients from several high-probability models, often performs better than 
the estimate of @ obtained by considering only a single model. Returning to 
the problem of predicting data from the diabetes test set, we can compute 
model-averaged predicted values Yies, = XGpma- These predicted values are 
plotted against the true values Yşest in the second panel of Figure 9.7. These 
predictions have an average squared error of 0.452, which is better than the 
OLS estimates using either the full model or the one obtained from backwards 
elimination. 

Finally, we evaluate the Bayesian model selection procedure when there is 
no relationship between Y and æ. Recall from above that when the backwards 
elimination procedure was applied to the permuted vector y, which was con- 
structed independently of X, it erroneously returned 18 regressors. Running 
the Gibbs sampler above on the same dataset (J, X) for 10,000 iterations pro- 
vides approximated posterior probabilities Pr(z; = 1|y,X) = >> a /S, all of 
which are less than 1/2, and all but two of which are less than 1/4. In contrast 
to the backwards selection procedure, for these data the Bayesian approach 
to model selection does not erroneously identify any regressors as having an 
effect on the distribution of y. 
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Fig. 9.7. The first panel shows posterior probabilities that each coefficient is non- 
zero. The second panel shows Ytest versus predictions based on the model averaged 
estimate of 8. 


9.4 Discussion and further references 


There are many approaches to Bayesian model selection that use prior distri- 
butions allowing elements of Ø to be identically zero. George and McCulloch 
(1993) parameterize 8; as bj x zj, where z; € {0,1}, and use Gibbs sampling 
to do model selection. Liang et al (2008) review various types of g-priors in 
terms of two types of asymptotic consistency: model consistency and predic- 
tive consistency. The former is concerned with selecting the “true model,” 
and the latter with making accurate posterior predictions. As pointed out by 
Leamer (1978), selecting a model and then acting as if it were true understates 
the uncertainty in the model selection process, and can result in suboptimal 
predictive performance. Predictive performance can be improved by Bayesian 
model averaging, i.e. averaging the predictive distributions under the different 
models according to their posterior probability (Madigan and Raftery, 1994; 
Raftery et al, 1997). 

Many have argued that in most situations none of the regression models 
under consideration are actually true. Results of Bernardo and Smith (1994, 
Section 6.1.6) and Key et al (1999) indicate that in this situation, Bayesian 
model selection can still be meaningful in a decision-theoretic sense, where 
the task is to select the model with the best predictive performance. In this 
case, model selection proceeds using a modified Bayes factor that is similar to 
a cross-validation criterion. 


10 


Nonconjugate priors and Metropolis-Hastings 
algorithms 


When conjugate or semiconjugate prior distributions are used, the posterior 
distribution can be approximated with the Monte Carlo method or the Gibbs 
sampler. In situations where a conjugate prior distribution is unavailable or 
undesirable, the full conditional distributions of the parameters do not have 
a standard form and the Gibbs sampler cannot be easily used. In this section 
we present the Metropolis-Hastings algorithm as a generic method of approx- 
imating the posterior distribution corresponding to any combination of prior 
distribution and sampling model. This section presents the algorithm in the 
context of two examples: The first involves Poisson regression, which is a type 
of generalized linear model. The second is a longitudinal regression model in 
which the observations are correlated over time. 


10.1 Generalized linear models 


Example: Song sparrow reproductive success 


A sample from a population of 52 female song sparrows was studied over 
the course of a summer and their reproductive activities were recorded. In 
particular, the age and number of new offspring were recorded for each sparrow 
(Arcese et al, 1992). Figure 10.1 shows boxplots of the number of offspring 
versus age. The figure indicates that two-year-old birds in this population 
had the highest median reproductive success, with the number of offspring 
declining beyond two years of age. This is not surprising from a biological 
point of view: One-year-old birds are in their first mating season and are 
relatively inexperienced compared to two-year-old birds. As birds age beyond 
two years they experience a general decline in health and activity. 

Suppose we wish to fit a probability model to these data, perhaps to un- 
derstand the relationship between age and reproductive success, or to make 
population forecasts for this group of birds. Since the number of offspring for 
each bird is a non-negative integer {0,1,2,...}, a simple probability model 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_10, 
© Springer Science+Business Media, LLC 2009 
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Fig. 10.1. Number of offspring versus age. 


for Y=number of offspring conditional on w=age would be a Poisson model, 
{Y |x} ~ Poisson(6,.). One possibility would be to estimate 0, separately for 
each age group. However, the number of birds of each age is small and so the 
estimates of 6, would be imprecise. To add stability to the estimation we will 
assume that the mean number of offspring is a smooth function of age. We 
will want to allow this function to be quadratic so that we can represent the 
increase in mean offspring while birds mature and the decline they experience 
thereafter. One possibility would be to express 0, as 0, = bı + Box + Bax’. 
However, such a parameterization might allow some values of 0, to be neg- 
ative, which is not physically possible. As an alternative, we will model the 
log-mean of Y in terms of this regression, so that 


log E[Y |x] = log 0, = 61 + Box + B32x?, 


which means that E[Y|z] = exp((@1 + G22 + 33x”), which is always greater 
than zero. 

The resulting model, {Y |æ} ~ Poisson(exp[@/ a]), is called a Poisson re- 
gression model. The term 37x is called the linear predictor. In this regres- 
sion model the linear predictor is linked to E[Y |æ] via the log function, and 
so we say that this model has a log link. The Poisson regression model is 
a type of generalized linear model, a model which relates a function of the 
expectation to a linear predictor of the form 37a. Another common gener- 
alized linear model is the logistic regression model for binary data. Writing 
Pr(Y = 1|æ) = E[Y|a] = Oz, the logistic regression model parameterizes 0, as 


T 
z= e 2) , so that 
1+exp(@° x) 
Bia = log Ox 


1— 0z 
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The function log 6,./(1—6,.) relating the mean to the linear predictor is called 
the logit function, so the logistic regression model could be described as a 
binary regression model with a logit link. Notice that the logit link forces 0, 
to be between zero and one, even though BTz can range over the whole real 
line. 

As in the case of ordinary regression, a natural class of prior distributions 
for 6 is the class of multivariate normal distributions. However, for neither the 
Poisson nor the logistic regression model would a prior distribution from this 
class result in a multivariate normal posterior distribution for B. Furthermore, 
standard conjugate prior distributions for generalized linear models do not 
exist (except for the normal regression model). 

One possible way to calculate the posterior distribution is to use a grid- 
based approximation, similar to the approach we used in Section 6.2: We 
can evaluate p(y|X, 3) x p(B) on a three-dimensional grid of G@-values, then 
normalize the result to obtain a discrete approximation to p(G|X, y). Figure 
10.2 shows approximate marginal and joint distributions of 32 and 83 based on 
the prior distribution 8 ~ multivariate normal(0, 100 x I) and a grid having 
100 values for each parameter. Computing these quantities for this three- 
parameter model required the calculation of p(y|X,) x p(B) at 1 million 
grid points. While feasible for this problem, a Poisson regression with only 
two more regressors and the same grid density would require 10 billion grid 
points, which is prohibitively large. Additionally, grid-based approximations 
can be very inefficient: The third panel of Figure 10.2 shows a strong negative 
posterior correlation between {2 and (3, which means that the probability 
mass is concentrated along a diagonal and so the vast majority of points of our 
cubical grid have essentially zero probability. In contrast, an approximation of 
p(B|X, y) based on Monte Carlo samples could be stored in a computer much 
more efficiently, since our Monte Carlo sample would not include any points 
that have essentially zero posterior probability. Although independent Monte 
Carlo sampling from the posterior is not available for this Poisson regression 
model, the next section will show how to construct a Markov chain that can 
approximate p(3|X, y) for any prior distribution p(). 


10.2 The Metropolis algorithm 


Let’s consider a very generic situation where we have a sampling model 
Y ~ p(y|@) and a prior distribution p(0). Although in most problems 
p(y|@) and p(@) can be calculated for any values of y and 0, p(@ly) = 
P(9)p(ylO)/ f p(")p(y|@’) dO’ is often hard to calculate due to the integral 
in the denominator. If we were able to sample from p(@|y), then we could 
generate 0), ..., 009) ~ ii.d. p(6|y) and obtain Monte Carlo approximations 
to posterior quantities, such as 
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Fig. 10.2. Grid-based approximations to p((J2|X, y), p(G3|X, y) and p(G2, G3|X, y). 


S 
Elo(@)ly] = =D. s00). 


But what if we cannot sample directly from p(@|y)? In terms of approxi- 
mating the posterior distribution, the critical thing is not that we have i.i.d. 
samples from p(0|y) but rather that we are able to construct a large collection 
of 6-values, {0),..., 0°}, whose empirical distribution approximates p(@|y). 
Roughly speaking, for any two different values 6, and 6) we need 


#{0°9)’s in the collection = 04} _ P(Paly) 
#{0(°s in the collection = 0} ~ DP(O|y) 


Let’s think intuitively about how we might construct such a collection. 
Suppose we have a working collection {9),...,0°)} to which we would like 
to add a new value 0+"), Let’s consider adding a value 6* which is nearby 
65), Should we include 6* in the set or not? If p(@*|y) > p(@“|y) then we 
want more 6*’s in the set than 6°)’s. Since 0°) is already in the set, then it 
seems we should include 0* as well. On the other hand, if p(6*|y) < p(@“|y) 
then it seems we should not necessarily include 6*. So perhaps our decision 
to include 6* or not should be based on a comparison of p(6*|y) to p(0|y). 
Fortunately, this comparison can be made even if we cannot compute p(0|y): 


— pO \y) _ p(yl@*)p(6*) p(y) P(y|9*)p(O*) 


ps 2 = . (10.1) 
P(A) |y) Ply) ploep) p(ylA°))p(0)) 
Having computed r, how should we proceed? 
Lier Te 


Intuition: Since 6°) is already in our set, we should include 6” as it 
has a higher probability than 0). 
Procedure: Accept 6* into our set, i.e. set 66+) = 6*. 


Ifr<l: 
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Intuition: The relative frequency of @-values in our set equal to 6* 
compared to those equal to 6°) should be p(6*|y)/p(0 |y) = r. This 
means that for every instance of 9°), we should have only a “fraction” 
of an instance of a 0* value. 

Procedure: Set 6+!) equal to either 6* or 6), with probability r 
and 1 — r respectively. 


This is the basic intuition behind the famous Metropolis algorithm. The 
Metropolis algorithm proceeds by sampling a proposal value 6* nearby the 
current value 6) using a symmetric proposal distribution J(6*\0)). Sym- 
metric here means that J(0»|0a) = J(@q|9p), i.e. the probability of proposing 
0* = 6, given that 6(5) = @, is equal to the probability of proposing 6* = 0a 
given that 0°) = ø. Usually J(@*|@°) is very simple, with samples from 
J(6*\0°)) being near 6“) with high probability. Examples include 

e = J(6*\0)) = uniform(6 — 5, 0°) + 6) ; 

J(6*|\0°)) = normal(@), 5) . 

The value of the parameter 6 is generally chosen to make the approximation 
algorithm run efficiently, as will be discussed in more detail shortly. 

Having obtained a proposal value 6*, we add either it or a copy of 6°) to 
our set, depending on the ratio r = p(0*|y)/p(0|y). Specifically, given 0°), 
the Metropolis algorithm generates a value 0+) as follows: 

1. Sample 6* ~ J(6|0°)); 
2. Compute the acceptance ratio 


„— PO ly) _ pp) 
P(A |y) — p(ylA))p(A)) © 


3. Let 
g(st1) — 6* with probability min(r, 1) 
~ | 6°) with probability 1 — min(r, 1). 
Step 3 can be accomplished by sampling u ~ uniform(0,1) and setting 
9+) — 6 if u <r and setting 06+) = 6) otherwise. 


Example: Normal distribution with known variance 


Let’s try out the Metropolis algorithm for the conjugate normal model with 
a known variance, a situation where we know the correct posterior distribu- 
tion. Letting @ ~ normal(y,77) and {y1,..-,Yn|0} ~ iid. normal(@,o7), the 
posterior distribution of @ is normal(jin, 77) where 


__ nfo? 1/7? 
Pn = Y aJo? F L/T? | Pno? + 1/7? 
T2 =1/(n/o* +1/7°). 
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Suppose o? = 1, 7? = 10, u = 5, n = 5 and y = (9.37, 10.18, 9.16, 11.60, 10.33). 
For these data, fn = 10.03 and 7? = .20, and so p(6|y) = dnorm(10.03, .44). 
Now suppose that for some reason we were unable to obtain the formula for 
this posterior distribution and needed to use the Metropolis algorithm to ap- 
proximate it. Based on this model and prior distribution, the acceptance ratio 
comparing a proposed value 6* to a current value 6“) is 


= p(O*|y) 7 ive dnorm(y;, 0*, c) ; dnorm(0*, u, T) 
~ p(O)|\y)  \IE dnorm(y;, 0), o) dnorm(6(9), p, T) J” 


In many cases, computing the ratio r directly can be numerically unstable, a 
problem that often can be remedied by computing the logarithm of r: 


logr = S [log dnorm(y;, 6*, 0) — log dnorm(y;, 0, 7)] + 
i=1 


log dnorm(6*, u, T) — log dnorm(0“), ju, T). 


Keeping things on the log scale, the proposal is accepted if log u < log r, where 
u is a sample from the uniform distribution on (0, 1). 

The R-code below generates 10,000 iterations of the Metropolis algorithm, 
starting at 06 = 0 and using a normal proposal distribution, @°+) ~ normal 
(6%), 6°) with 6? =2. 
s2<—1 ; t2<—10 ; mu<—5 
yec Har, 1013, OG, I1.60, 10.33) 
theta<—O ; delta2<—2 ; S<—10000 ; THETA<—NULL ; set.seed(1) 


for(s in 1:8) 
{ 


theta.star<rnorm(1,theta , sqrt (delta2)) 


log .r<—( sum(dnorm(y,theta.star ,sqrt(s2),log=IMRUE)) + 
dnorm(theta.star ,mu, sqrt (t2),log=TRUE) ) 
( sum(dnorm(y, theta ,sqrt(s2),log=TRUE)) + 
dnorm(theta ,mu, sqrt (t2) ,log=fRUE) ) 


if(log(runif(1))<log.r) { theta<theta.star } 


THETA<-c (THETA, theta) 


The first panel of Figure 10.3 plots these 10,000 simulated values as a 
function of iteration number. Although the value of @ starts nowhere near 
the posterior mean of 10.03, it quickly arrives there after a few iterations. The 
second panel gives a histogram of the 10,000 6-values, and includes a plot of the 
normal(10.03, 0.20) density for comparison. Clearly the empirical distribution 
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Fig. 10.3. Results from the Metropolis algorithm for the normal model. 


of the simulated values is very close to the true posterior distribution. Will 
this similarity between {9™,...,0°5)} and p(6|y) hold in general? 


Output of the Metropolis algorithm 


The Metropolis algorithm generates a dependent sequence {0),9),...} of 
6-values. Since our procedure for generating 0+) depends only on 6“), the 
conditional distribution of 6+) given {@™,...,0°)} also depends only on 
6°) and so the sequence {6 , @@),...} is a Markov chain. 

Under some mild conditions the marginal sampling distribution of 0“) is 
approximately p(0|y) for large s. Additionally, for any given numerical value 
Oa of 8, 

. #{0’s in the sequence < 6,} 
lim = 


jim, 3 p(O < baly). 


Just as with the Gibbs sampler, this suggests we can approximate posterior 
means, quantiles and other posterior quantities of interest using the empirical 
distribution of {0®,...,0(9)}. However, our approximation to these quanti- 
ties will depend on how well our simulated sequence actually approximates 
p(6|y). Results from probability theory say that, in the limit as S — oo, the 
approximation will be exact, but in practice we cannot run the Markov chain 
forever. Instead, the standard practice in MCMC approximation, using either 
the Metropolis algorithm or the Gibbs sampler, is as follows: 


1. run algorithm until some iteration B for which it looks like the Markov 
chain has achieved stationarity; 

2. run the algorithm S more times, generating {08t9,... ,0B+9)}, 

3. discard {0%®,...,0(P)} and use the empirical distribution of {08+9.,. 
0(B+5)} to approniaai p(Aly). 
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The iterations up to and including B are called the “burn-in” period, in which 
the Markov chain moves from its initial value to a region of the parameter 
space that has high posterior probability. If we have a good idea of where this 
high probability region is, we can reduce the burn-in period by starting the 
Markov chain there. For example, in the Metropolis algorithm above it would 
have been better to start with 07) = 7 as we know that the posterior mode 
will be near 7. However, starting with 6“) = 0 illustrates that the Metropolis 
algorithm is able to move from a low posterior probability region to one of 
high probability. 

The 6-values generated from an MCMC algorithm are statistically depen- 
dent. Recall from the discussion of MCMC diagnostics in Chapter 6 that the 
higher the correlation, the longer it will take for the Markov chain to achieve 
stationarity and the more iterations it will take to get a good approximation to 
p(@ly). Roughly speaking, the amount of information we obtain about E[6|y] 
from S' positively correlated samples is less than the information we would ob- 
tain from S independent samples. The more correlated our Markov chain is, 
the less information we get per iteration (recall the notion of “effective sample 
size” from Section 6.6). In Gibbs sampling we do not have much control over 
the correlation of the Markov chain, but with the Metropolis algorithm the 
correlation can be adjusted by selecting an optimal value of 6 in the proposal 
distribution. By selecting ô carefully, we can decrease the correlation in the 
Markov chain, leading to an increase in the rate of convergence, an increase 
in the effective sample size of the Markov chain and an improvement in the 
Monte Carlo approximation to the posterior distribution. 


a 2 4 
= = S 
D 
N N N 
o ° 5 
0 100 300 500 0 100 300 500 0 100 300 500 
iteration iteration iteration 


Fig. 10.4. Markov chains under three different proposal distributions. Going from 
left to right, the values of 5? are 1/32, 2 and 64 respectively. 


To illustrate this, we can rerun the Metropolis algorithm for the one-sample 
normal problem using a range of ô values, including 6? € {1/32, 1/2, 2, 32, 
64 }. Doing so results in lag-1 autocorrelations of (0.98, 0.77, 0.69, 0.84, 0.86) 
for these five different 6-values. Interestingly, the best -value among these 
five occurs in the middle of the set of values, and not at the extremes. The 
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reason why can be understood by inspecting each sequence. Figure 10.4 plots 
the first 500 values for the sequences corresponding to 6? € {1/32, 2,64}. In 
the first panel where 6? = 1/32, the small proposal variance means that 6* 
will be very close to 6‘), and so r ~ 1 for most proposed values. As a result, 
6* is accepted as the value of 6+) for 87% of the iterations. Although this 
high acceptance rate keeps the chain moving, the moves are never very large 
and so the Markov chain is highly correlated. One consequence of this is that 
it takes a large number of iterations for the Markov chain to move from the 
starting value of zero to the posterior mode of 10.03. At the other extreme, 
the third plot in the figure shows the Markov chain for 6? = 64. In this case 
the chain moves quickly to the posterior mode but once there it gets “stuck” 
for long periods. This is because the variance of the proposal distribution is so 
large that 0* is frequently very far away from the posterior mode. Proposals 
in this Metropolis algorithm are accepted for only 5% of the iterations, and 
so Ot) is set equal to 6°) 95% of the time, resulting in a highly correlated 
Markov chain. 

In order to construct a Markov chain with a low correlation we need a 
proposal variance that is large enough so that the Markov chain can quickly 
move around the parameter space, but not so large that the proposals end up 
getting rejected most of the time. Among the proposal variances considered 
for the data and normal model here, this balance was optimized with a 8? of 
2, which gives an acceptance rate of 35%. In general, it is common practice to 
first select a proposal distribution by implementing several short runs of the 
Metropolis algorithm under different -values until one is found that gives an 
acceptance rate roughly between 20 and 50%. Once a reasonable value of 6 is 
selected, a longer more efficient Markov chain can be run. Alternatively, mod- 
ified versions of the Metropolis algorithm can be constructed that adaptively 
change the value of 6 at the beginning of the chain in order to automatically 
find a good proposal distribution. 


10.3 The Metropolis algorithm for Poisson regression 


Let’s implement the Metropolis algorithm for the Poisson regression model 
introduced at the beginning of the chapter. Recall that the model is that Y; is 
a sample from a Poisson distribution with a log-mean given by log E[Y;|x;] = 
Bı + Box; + G3x?, where x; is the age of the sparrow i. We will abuse notation 
slightly by writing x; = (1,2;,x?) so that log E[Y;|x;] = 8’ «;. The prior 
distribution we used in Section 10.1 was that the regression coefficients were 
iid. normal(0,100). Given a current value B® and a value 3* generated from 
J(3*|@), the acceptance ratio for the Metropolis algorithm is 
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_ P(B"|X, y) 

p(B |X, y) 
II; dpois(y:, «7 B*) : II$- dnorm(8¥,0, 10) 
I; dpois(y:, 27 B®) Mia dnorm(3\”, 0, 10) 


All that remains to implement the algorithm is to specify the proposal distri- 
bution for 0*. A convenient choice is a multivariate normal distribution with 
mean B®), In many problems, the posterior variance can be an efficient choice 
of a proposal variance. Although we do not know the posterior variance be- 
fore we run the Metropolis algorithm, it is often sufficient just to use a rough 
approximation. In a normal regression problem, the posterior variance of 3 
will be close to o?(X’X)~!, where ø? is the variance of Y. In our Poisson 
regression, the model is that the log of Y has expectation equal to B' x, so 
let’s try a proposal variance of 62(X/X)~! where 6? is the sample variance of 
{log(y +1/2),...,log(yn +1/2)} (we use log(y+1/2) instead of log y because 
the latter would be —oo if y = 0). If this results in an acceptance rate that is 
too high or too low, we can always adjust the proposal variance accordingly. 

R-code to implement the Metropolis algorithm for a Poisson regression of 
y on X is as follows: 


data(chapterl0) ; y<-yX.sparrow[,1] ; X<-yX.sparrow|,-1] 
n<length(y) ; p<dim(X)[2] 


pmn. beta<rep (0,p) #prior expectation 
psd. beta<rep(10,p) #prior var 


var.prop<— var(log(y+1/2))*solve( t(X)%*%X ) #proposal var 
S<—10000 

beta<rep(0,p) ; acs<—0 

BETA<—matrix (0 , nrow=S , ncol=p) 

set .seed (1) 


tore (Ss ma 13'S) 


x 
beta.p< t(rmvnorm(1, beta, var.prop )) 
lhr< sum( dpois (y , exp (X%*%beta.p),log=T)) — 
sum( dpois (y , exp (X%*%beta ) ,log=T)) + 
sum(dnorm(beta.p,pmn. beta, psd. beta ,log=T)) — 
sum(dnorm(beta ,pmn. beta , psd. beta , log=T)) 
if( log(runif(1))< lhr ) { beta<-beta.p ; acs<—-acs+1 } 
BETA[s,]<—beta 
} 


Applying this algorithm to the song sparrow data gives an acceptance 
rate of about 43%. A plot of 33 versus iteration number appears in the first 
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Fig. 10.5. Plot of the Markov chain in (3 along with autocorrelation functions. 


panel of Figure 10.5. The algorithm moves quickly from the starting value 
of G3 = 0 to a region closer to the posterior mode. The second panel of the 
figure shows the autocorrelation function for @3. We could possibly reduce 
the autocorrelation by modifying the proposal variance and obtaining a new 
Markov chain, although this Markov chain is perhaps sufficient to obtain a 
good approximation to p(@|X, y). For example, the third panel of Figure 10.5 
plots the autocorrelation function of every 10th value of 33 from the Markov 
chain. This “thinned” subsequence contains 1,000 of the 10,000 (3 values, 
but these 1,000 values are nearly independent. This suggests we have nearly 
the equivalent of 1,000 independent samples of @3 with which to approximate 
the posterior distribution. To be more precise, we can calculate the effective 
sample size as described in Section 6.6. The effective sample sizes for (4, 
G2 and (3 are 818, 778 and 726 respectively. The adequacy of this Markov 
chain is confirmed further in the first two panels of Figure 10.6, which plots 
the MCMC approximations to the marginal posterior densities of 32 and (3. 
These densities are nearly identical to the ones obtained from the grid-based 
approximation, which are shown in gray lines for comparison. Finally, the 
third panel of the figure plots posterior quantiles of E[Y |x] for each age z, 
which indicates the quadratic nature of reproductive success for this song 
sparrow population. 


10.4 Metropolis, Metropolis-Hastings and Gibbs 


Recall that a Markov chain is a sequentially generated sequence {x x), ...} 
such that the mechanism that generates «+! can depend on the value of 
x) but not on {a6-),a¢-2),...¢}. A more poetic way of putting this 
is that for a Markov chain “the future depends on the present and not on 
the past.” The Gibbs sampler and the Metropolis algorithm are both ways of 
generating Markov chains that approximate a target probability distribution 
po(x) for a potentially vector-valued random variable x. In Bayesian analysis, 
x is typically a parameter or vector of parameters and po(x) is a posterior 
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Fig. 10.6. The first two panels give the MCMC approximations to the posterior 
marginal distributions of G2 and (3 in black, with the grid-based approximations in 
gray. The third panel gives 2.5%, 50% and 97.5% posterior quantiles of exp(@7 æ). 


distribution, but the Gibbs sampler and Metropolis algorithm are both used 
more broadly. 

In this section we will show that these two algorithms are in fact spe- 
cial cases of a more general algorithm, called the Metropolis-Hastings algo- 
rithm. We will then describe why Markov chains generated by the Metropolis- 
Hastings algorithm are able to approximate a target probability distribution. 
Since the Gibbs and Metropolis algorithms are special cases of Metropolis- 
Hastings, this implies that these two algorithms are also valid ways to ap- 
proximate probability distributions. 


10.4.1 The Metropolis-Hastings algorithm 


We’ll first consider a simple example where our target probability distribution 
is po(u, v), a bivariate distribution for two random variables U and V. In the 
one-sample normal problem, for example, we would have U = 6, V = a? and 
po(u, v) = 8, oly). 

Recall that the Gibbs sampler proceeds by iteratively sampling values of 
U and V from their conditional distributions: Given x) = (u*), v")), a new 


value of z+?) is generated as follows: 
1. update U: sample u@+) ~ po(ulv?); 


2. update V: sample vt) ~ po(vjustY). 


Alternatively, we could have first sampled vlt) ~ po(v|u*)) and then 
ul) po(uloF). 

In contrast, the Metropolis algorithm proposes changes to X = (U,V) and 
then accepts or rejects those changes based on po. In the Poisson regression 
example the proposed vector differed from its current value at each element 
of the vector, but this is not necessary. An alternative way to implement the 
Metropolis algorithm is to propose and then accept or reject changes to one 
element at a time: 
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1. update U: 

a) sample u* ~ Jy,(ulu‘?); 

b) compute r = po(u*, v))/po(ul®), v&)); 

c) set ut) to u* or ul’) with probability min(1,r) and max(0, 1 — r). 
2. update V: 

a) sample v* ~ J,(v|v e) 

b) compute r= polut? v *)/polu t», v); 

c) set vlt?) to v* or uv’) with probability min(1,r) and max(0,1 — r). 


Here, J,, and J, are separate symmetric proposal distributions for U and V. 

This Metropolis algorithm generates proposals from J, and J, and ac- 
cepts them with some probability min(1, r). Similarly, each step of the Gibbs 
sampler can be seen as generating a proposal from a full conditional dis- 
tribution and then accepting it with probability 1. The Metropolis-Hastings 
algorithm generalizes both of these approaches by allowing arbitrary proposal 
distributions. The proposal distributions can be symmetric around the current 
values, full conditional distributions, or something else entirely. A Metropolis- 
Hastings algorithm for approximating po(u,v) runs as follows: 


1. update U: 
a) sample u* ~ J,(ulu), v“)); 
b) compute the acceptance ratio 


pola wO)  Ju(ul®)|ut,v)) 
po(uls), v(s)) Jy,(u*|uls), ufs)) f 


r = 


c) set ut) to u* or ul®) with probability min(1,r) and max(0, 1 — r). 
2. update V: 

a) sample v* ~ Jy(vjulst), vt)); 

b) compute the acceptance ratio 


_ po(uet), v*) 7 Jy(v® u HD y*) 
MUE a Ueto)’ 


c) set vt to v* or v) with probability min(1,r) and max(0,1 — r). 


In this algorithm the proposal distributions J,, and J, are not required to be 
symmetric. In fact, the only requirement is that they do not depend on U or V 
values in our sequence previous to the most current values. This requirement 
ensures that the sequence is a Markov chain. 

The Metropolis-Hastings algorithm looks a lot like the Metropolis algo- 
rithm, except that the acceptance ratio contains an extra factor, the ratio of 
the probability of generating the current value from the proposed to the prob- 
ability of generating the proposed from the current. This can be viewed as a 
“correction factor:” If a value u* is much more likely to be proposed than the 
current value ul®), then we must down-weight the probability of accepting u* 
accordingly, otherwise the value u* will be overrepresented in our sequence. 
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That the Metropolis algorithm is a special case of the Metropolis-Hastings 
algorithm is easy to see: If J, is symmetric, meaning that J(ua|uy,v) = 
J(up|Ua,v) for all possible ua, ua and v, then the correction factor in the 
Metropolis-Hastings acceptance ratio is equal to 1 and the acceptance proba- 
bility is the same as in the Metropolis algorithm. That the Gibbs sampler is a 
type of Metropolis-Hastings algorithm is almost as easy to see. In the Gibbs 
sampler the proposal distribution for U is the full conditional distribution of 
U given V = v. If we use the full conditionals as our proposal distributions 
in the Metropolis-Hastings algorithm, we have J,,(u*|u), vu) = po(u*|v). 
The Metropolis-Hastings acceptance ratio is then 


po(u*, v“)) Jy (ul? u*v) 


T pou), v0) Ja ju), v)) 
po(u*,v) polu v) 

= mu, 0) pola ee) 

_ po(u*|v"*))pq(u6)) po (ul® |v?) 

= po(uls)|uls))p9(uls)) po(u*|v)) 

po(v“*)) 


= ——_—_ = 1 


Po v(s)) 


? 


and so if we propose a value from the full conditional distribution the accep- 
tance probability is 1, and the algorithm is equivalent to the Gibbs sampler. 


10.4.2 Why does the Metropolis-Hastings algorithm work? 


A more general form of the Metropolis-Hastings algorithm is as follows: Given 
a current value x) of X, 


1. Generate «* from Js(x* |æ"); 
2. Compute the acceptance ratio 


_ po(2*) z Jala) |e"). 
polzi) Fg(a* a's)’ 


3. Sample u ~uniform(0, 1). If u < r set c+) = a*, else set ct) = x), 


Note that the proposal distribution may also depend on the iteration number 
s. For example, the Metropolis-Hastings algorithm presented in the last sec- 
tion can be equivalently described by steps 1, 2 and 3 above by setting Js to 
be equal to J,, for odd values of s and equal to J, for even values. This makes 
the algorithm alternately update values of U and V. 

The primary restriction we place on J,(x*|x“) is that it does not depend 
on values in the sequence previous to x‘*). This restriction ensures that the 
algorithm generates a Markov chain. We also want to choose J; so that the 
Markov chain is able to converge to the target distribution po. For example, 
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we want to make sure that every value of x such that po(x) > 0 will eventually 
be proposed (and so accepted some fraction of the time), regardless of where 
we start the Markov chain. An example in which this is not the case is where 
the values of X having non-zero probability are the integers, and J,(a*|a°)) 
proposes «‘*) + 2 with equal probability. In this case the Metropolis-Hastings 
algorithm produces a Markov chain, but the chain will only generate even 
numbers if x“) is even, and only odd number if x“) is odd. This type of 
Markov chain is called reducible, as the set of possible X-values can be divided 
into non-overlapping sets (even and odd integers in this example), between 
which the algorithm is unable to move. In contrast, we want our Markov chain 
to be irreducible, that is, able to go from any one value of X to any other, 
eventually. 

Additionally, we will want J, to be such that the Markov chain is aperiodic 
and recurrent. A value x is periodic with period k > 1 in a Markov chain if it 
can only be visited every kth iteration. If x is periodic, then for every S there 
are an infinite number of iterations s > S for which Pr(z(° = x) = 0. Since 
we want the distribution of x‘) to converge to po, we should make sure that if 
po(x) > 0, then z is not periodic in our Markov chain. A Markov chain lacking 
any periodic states is called aperiodic. Finally, if a) = x for some iteration 
s, then this must mean that po(a) > 0. Therefore, we want our Markov chain 
to be able to return to x from time to time as we run our chain (otherwise the 
relative fraction of x’s in the chain will go to zero, even though po(x) > 0). A 
value x is said to be recurrent if, when we continue to run the Markov chain 
from x, we are guaranteed to eventually return to x. Clearly we want all of 
the possible values of X to be recurrent in our Markov chain. 

An irreducible, aperiodic and recurrent Markov chain is a very well be- 
haved object. A theorem from probability theory says that the empirical dis- 
tribution of samples generated from such a Markov chain will converge: 


Theorem 2 (Ergodic Theorem) If {x ,2®),...} is an irreducible, aperiodic 
and recurrent Markov chain, then there is a unique probability distribution m 
such that as s —> co, 


e Pr(x® € A) > r(A) for any set A; 
5 92) > J g(a)m(x) de. 


The distribution v is called the stationary distribution of the Markov chain. 
It is called the stationary distribution because it has the following property: 


If 2) ~ x, 
and x*+) is generated from the Markov chain starting at x), 
then Pr(x +) € A) = r(A). 


In other words, if you sample «‘*) from 7 and then generate «+ conditional 
on x(°) from the Markov chain, then the unconditional distribution of xst» 
is 7. Once you are sampling from the stationary distribution, you are always 
sampling from the stationary distribution. 
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In most problems it is not too hard to construct Metropolis-Hastings al- 
gorithms that generate Markov chains that are irreducible, aperiodic and re- 
current. For example, if po(x) is continuous, then using a normal proposal 
distribution centered around the current value guarantees that Pr(a¢t+) e€ 
Ala) = x) > 0 for every z, s and set A such that po(A) > 0. All of the 
Metropolis-Hastings algorithms in this book generate Markov chains that are 
irreducible, aperiodic and recurrent. As such, sequences of X-values gener- 
ated from these algorithms can be used to approximate their stationary dis- 
tributions. What is left to show is that the stationary distribution m for a 
Metropolis-Hastings algorithm is equal to the distribution pọ we wish to ap- 
proximate. 


“Proof” that n(x) = po(z) 


The theorem above says that the stationary distribution of the Metropolis- 
Hastings algorithm is unique, and so if we show that po is a stationary dis- 
tribution, we will have shown it is the stationary distribution. Our sketch of 
a proof follows closely a proof from Gelman et al (2004) for the Metropolis 
algorithm. In that proof and here, it is assumed for simplicity that X is a 
discrete random variable. Suppose x‘) is sampled from the target distribu- 
tion po, and then x*+") is generated from rls) using the Metropolis-Hastings 
algorithm. To show that po is the stationary distribution we need to show 
that Pr(a(s+) = x) = po(z). 

Let za and 2, be any two values of X such that po(aa)Js(xol@a) > 
po(&p)Js(Lalvp~). Then under the Metropolis-Hastings algorithm the proba- 
bility that 2) = £a and «+ = zy is equal to the probability of 


1. sampling x) = xq from po; 
2. proposing «* = x, from Js(x*|2)); 
3. accepting xt») = ry. 


The probability of these three things occurring is their product: 


polze) J5(Xa|Xp) 


Pr(x) = za, £ tD = xp) = polza) x Js(x0|2a) X 
= po(tp)Js(LalXo) - 


On the other hand, the probability that 2) = x, and «+ = a, is the 
probability that xp is sampled from po, that £a is proposed from J,(2*|x°)) 
and that £a is accepted as z+"), But in this case the acceptance probability 
is one because we assumed po(£a)Js(£o|£a) > Poly) Js(%alay). This means 
that Pr(x') = ay, 26+) = £a) = polz) Js(£alzo). 

The above two calculations have shown that the probability of observing 
x) and z+») to be aq and 2p, respectively, is the same as observing them 
to be a and £a respectively, for any two values £a and x». The final step of 
the proof is to use this fact to derive the marginal probability Pr(a*+) = x): 
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Pr(g t») = x) =- S Pia? =; gS) = Ta) 
=F Pie seat) 


= Pr(z®®) = x) 


This completes the proof that Pr(a&+) = x) = po(x) if Pr(x® = x) = po(a). 


10.5 Combining the Metropolis and Gibbs algorithms 


In complex models it is often the case that conditional distributions are avail- 
able for some parameters but not for others. In these situations we can combine 
Gibbs and Metropolis-type proposal distributions to generate a Markov chain 
to approximate the joint posterior distribution of all of the parameters. In this 
section we do this in the context of estimating the parameters in a regression 
model for time-series data where the errors are temporally correlated. In this 
case, full conditional distributions are available for the regression parameters 
but not the parameter describing the dependence among the observations. 


Example: Historical CO and temperature data 


Analyses of ice cores from East Antarctica have allowed scientists to deduce 
historical atmospheric conditions of the last few hundred thousand years (Pe- 
tit et al, 1999). The first plot of Figure 10.7 plots time-series of temperature 
and carbon dioxide concentration on a standardized scale (centered and scaled 
to have a mean of zero and a variance of one). The data include 200 values 
of temperature measured at roughly equal time intervals, with time between 
consecutive measurements being approximately 2,000 years. For each value of 
temperature there is a CO2 concentration value corresponding to a date that 
is roughly 1,000 years previous to the temperature value, on average. Temper- 
ature is recorded in terms of its difference from current present temperature 
in degrees Celsius, and CO, concentration is recorded in parts per million by 
volume. 

The plot indicates that the temporal history of temperature and COə2 
follow very similar patterns. The second plot in Figure 10.7 indicates that CO2 
concentration at a given time point is predictive of temperature following that 
time point. One way to quantify this is by fitting a linear regression model for 
temperature (Y) as a function of CO» (x). Ordinary least squares regression 
gives an estimated model of E[Y |x] = —23.02+0.08x with a nominal standard 
error of 0.0038 for the slope term. The validity of this standard error relies 
on the error terms in the regression model being independent and identically 
distributed, and standard confidence intervals further rely on the errors being 
normally distributed. 
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Fig. 10.7. Temperature and carbon dioxide data. 
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Fig. 10.8. Residual analysis for the least squares estimation. 


These two assumptions are examined in the two residual diagnostic plots 
in Figure 10.8. The first plot, a histogram of the residuals, indicates no seri- 
ous deviation from non-normality. The second plot gives the autocorrelation 
function of the residuals, and indicates a nontrivial correlation of 0.52 be- 
tween residuals at consecutive time points. Such a positive correlation gener- 
ally means that there is less information in the data, and less evidence for a 
relationship between the two variables, than is assumed by the ordinary least 
squares regression analysis. 


10.5.1 A regression model with correlated errors 


The ordinary regression model is 
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Ya 
Y =| : | ~ multivariate normal(X£, o’°T). 
Yn 


The diagnostic plots suggest that a more appropriate model for the ice core 
data is one in which the error terms are not independent, but temporally 
correlated. This means we must replace the covariance matrix o7I in the ordi- 
nary regression model with a matrix X that can represent positive correlation 
between sequential observations. One simple, popular class of covariance ma- 
trices for temporally correlated data are those having first-order autoregressive 
structure: 


Under this covariance matrix the variance of {Y;|8, xi} is o? but the correla- 
tion between Y; and Y;+; is pt, which decreases to zero as the time difference 
t becomes larger. 

Having observed Y = y, the parameters to estimate in this model in- 
clude 8, a? and p. Using the multivariate normal and inverse-gamma prior 
distributions of Section 9.2.1 for @ and a7, it is left as an exercise to show 
that 


{8|X, y, o°, p} ~ multivariate normal(@,,, 
En = (KC, X/o? + Xg) 
Bn = ¥Yn(XTC7'y/0? + X560) , and 
{o?|X, y, B, p} ~ inverse-gamma([vo + n] /2, [voo + SSR,|/2) , where 
SSR, = (y — X8)” C7 (y — XB). 


Xn) , where (10.2) 


If 6o = 0 and Xo has large diagonal entries, then B, is very close to 
Oo, ke) XTC y. If p were known this would be the generalized least 
squares (GLS) estimate of 3, a type of weighted least squares estimate that is 
used when the error terms are not independent and identically distributed. In 
such situations, both OLS and GLS provide unbiased estimates of 8 but the 
GLS estimate has a lower variance. Bayesian analysis using a model that ac- 
counts for the correlated errors provides parameter estimates that are similar 
to those of GLS, so for convenience we will refer to our analysis as “Bayesian 
GLS.” 

If we knew the value of p we could use the Gibbs sampler to approximate 
p(B, o°?|X,y, p) by iteratively sampling from the full conditional distributions 
given by the equations in 10.2. Of course p is unknown and so we will need to 
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estimate it as well with our Markov chain. Unfortunately the full conditional 
distribution for p will be nonstandard for most prior distributions, suggesting 
that the Gibbs sampler is not applicable here and we may have to use a 
Metropolis algorithm (although a discrete approximation to p(p|X, y, 3,07) 
could be used). 

It is in situations like this that the generality of the Metropolis-Hastings 
algorithm comes in handy. Recall that in this algorithm we are allowed to 
use different proposal distributions at each step. We can iteratively update 6, 
a? and p at different steps, making proposals with full conditional distribu- 
tions for 3 and o? (Gibbs proposals) and a symmetric proposal distribution 
for p (a Metropolis proposal). Following the rules of the Metropolis-Hastings 
algorithm, we accept with probability 1 any proposal coming from a full con- 
ditional distribution, whereas we have to calculate an acceptance probability 
for proposals of p. Given { 8), 0202), p}, a Metropolis-Hastings algorithm 
to generate a new set of parameter values is as follows: 


1. Update 68: Sample get) ~ multivariate normal(Z,,,X’,), where 3,, and 
Xn depend on 07%) and p“). 

2. Update o?: Sample 02+) ~ inverse-gamma([vp +: n]/2, [Yoo + SSR] /2) 
where SSR, depends on BETI and p). 

3. Update p: 
a) Propose p* ~ uniform(p‘*) — 6, p) + ô). If p* < 0 then reassign it to 

be |p*|. If p* > 1 reassign it to be 2 — p*. 

b) Compute the acceptance ratio 


? 


_ p(y|X, BST? 026+, p*)p(p*) 
p(y|X, BST), o2(s+), pls))p(pl)) 


and sample u ~ uniform(0,1). If u < r set p+ = p*, otherwise set 
p(t) = pl), 


The proposal distribution used in Step 3.a is called a reflecting random walk, 
which ensures that 0 < p < 1. It is left as an exercise to show that this 
proposal distribution is symmetric. We also leave it as an exercise to show 
that the value of r given in Step 3.b is numerically equal to 


p(BEt), c+), ly, X) 
p(pet, g2(s+1), pt) ly, X) i 


the ratio as given in the definition of the Metropolis algorithm. 

While technically the steps above constitute “three iterations” of the 
Metropolis-Hastings algorithm, it is convenient to group them together as 
one. A sequence of Metropolis-Hastings steps in which each parameter is up- 
dated is often referred to as a scan of the algorithm. 
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10.5.2 Analysis of the ice core data 


We'll use diffuse prior distributions for the parameters, with Bo = 0, Xo = 
diag(1000), vo = 1 and o? = 1. Our prior for p will be the uniform distri- 
bution on (0,1). The first panel of Figure 10.9 plots the first 1,000 values 
{p),..., p91 generated using the Metropolis-Hastings algorithm above. 
The acceptance rate for these values is 0.322 which seems good, but the au- 
tocorrelation of the sequence, shown in the second panel, is very high. The 
effective sample size for this correlated sequence of 1,000 p-values is only 23, 
indicating that we will need many more iterations of the algorithm to obtain 
a decent approximation to the posterior distribution. 
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Fig. 10.9. The first 1,000 values of p generated from the Markov chain. 


Suppose we were to generate 25,000 scans for a total of 25,000 x 4 = 
100,000 parameter values. Storing and manipulating all of these values can 
be tedious and somewhat unnecessary: Since the Markov chain is so highly 
correlated, the values of p“) and p+") offer roughly the same information 
about the posterior distribution. With this in mind, for highly correlated 
Markov chains with moderate to large numbers of parameters we will often 
only save a fraction of the scans of the Markov chain. This practice of throwing 
away many iterations of a Markov chain is known as thinning. Figure 10.10 
shows the thinned output of a 25,000-scan Markov chain for the ice core data, 
in which only every 25th scan was saved. Thinning the output reduces it 
down to a manageable 1,000 samples, having a much lower autocorrelation 
than 1,000 sequential samples from an unthinned Markov chain. 

The Monte Carlo approximation to the posterior density of G2, the slope 
parameter, appears in the first panel of Figure 10.11. The posterior mean of (2 
is 0.028 and a posterior 95% quantile-based confidence interval is (0.01, 0.05) 


? 
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Fig. 10.10. Every 25th value of p from a Markov chain of length 25,000. 


indicating evidence that the relationship between COzg and temperature is 
positive. However, as indicated in the second plot this relationship seems 
much weaker than that suggested by the OLS estimate of 0.08. For the OLS 
estimation, the small number of data points with high y-values have a larger 
amount of influence on the estimate of 8. In contrast, the GLS model rec- 
ognizes that many of these extreme points are highly correlated with one 
another and down-weights their influence. We note that this “weaker” regres- 
sion coefficient is a result of the temporally correlated data, and not of the 
particular prior distribution we used or the Bayesian approach in general. The 
reader is encouraged to repeat the analysis with different prior distributions, 
or to perform a non-Bayesian GLS estimation for comparison. In any case, 
the data analysis indicates evidence of a relationship between temperature 
measurements and the CO measurements that precede them in time. 


10.6 Discussion and further references 


The Metropolis algorithm was introduced by Metropolis et al (1953) in an 
application to a problem in statistical physics. The algorithm was general- 
ized by Hastings (1970), but it was not until the publication of Gelfand and 
Smith (1990) that MCMC became widely used in the statistics community. 
See Robert and Casella (2008) for a brief history of Monte Carlo and MCMC 
methods. 

A number of modifications and extensions of MCMC methods have ap- 
peared since the 1990s. One technique that is broadly applicable is automatic, 
adaptive tuning of the proposal distribution in order to achieve good mixing 
(Gilks et al, 1998; Haario et al, 2001). Not all adaptive algorithms will result 
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Fig. 10.11. Posterior distribution of the slope parameter (G2, along with the poste- 
rior mean regression line. 


in chains that converge to the target distribution, but there are known condi- 
tions under which convergence is guaranteed (Atchadé and Rosenthal, 2005; 
Roberts and Rosenthal, 2007). 


11 


Linear and generalized linear mixed effects 
models 


In Chapter 8 we learned about the concept of hierarchical modeling, a data 
analysis approach that is appropriate when we have multiple measurements 
within each of several groups. In that chapter, variation in the data was rep- 
resented with a between-group sampling model for group-specific means, in 
addition to a within-group sampling model to represent heterogeneity of ob- 
servations within a group. In this chapter we extend the hierarchical model 
to describe how relationships between variables may differ between groups. 
This can be done with a regression model to describe within-group variation, 
and a multivariate normal model to describe heterogeneity among regression 
coefficients across the groups. We also cover estimation for hierarchical gen- 
eralized linear models, which are hierarchical models that have a generalized 
linear regression model representing within-group heterogeneity. 


11.1 A hierarchical regression model 


Let’s return to the math score data described in Section 8.4, which included 
math scores of 10th grade children from 100 different large urban public high 
schools. In Chapter 8 we estimated school-specific expected math scores, as 
well as how these expected values varied from school to school. Now sup- 
pose we are interested in examining the relationship between math score 
and another variable, socioeconomic status (SES), which was calculated from 
parental income and education levels for each student in the dataset. 

In Chapter 8 we quantified the between-school heterogeneity in expected 
math score with a hierarchical model. Given the amount of variation we ob- 
served it seems possible that the relationship between math score and SES 
might vary from school to school as well. A quick and easy way to assess this 
possibility is to fit a linear regression model of math score as a function of 
SES for each of the 100 schools in the dataset. To make the parameters more 
interpretable we will center the SES scores within each school separately, so 
that the sample average SES score within each school is zero. As a result, the 
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intercept of the regression line can be interpreted as the school-level average 
math score. 
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Fig. 11.1. Least squares regression lines for the math score data, and plots of 
estimates versus group sample size. 


The first panel of Figure 11.1 plots least squares estimates of the regression 
lines for the 100 schools, along with an average of these lines in black. A 
large majority show an increase in expected math score with increasing SES, 
although a few show a negative relationship. The second and third panels of 
the figure relate the least squares estimates to sample size. Notice that schools 
with the highest sample sizes have regression coefficients that are generally 
close to the average, whereas schools with extreme coefficients are generally 
those with low sample sizes. This phenomenon is reminiscent of what we 
discussed in Section 8.4: The smaller the sample size for the group, the more 
probable that unrepresentative data are sampled and an extreme least squares 
estimate is produced. As in Chapter 8, our remedy to this problem will be 
to stabilize the estimates for small sample size schools by sharing information 
across groups, using a hierarchical model. 

The hierarchical model in the linear regression setting is a conceptually 
straightforward generalization of the normal hierarchical model from Chapter 
8. We use an ordinary regression model to describe within-group heterogeneity 
of observations, then describe between-group heterogeneity using a sampling 
model for the group-specific regression parameters. Expressed symbolically, 
our within-group sampling model is 


Yay = B7 tij + Ei j > {eij} ~ iid. normal(0, o°), (11.1) 


where æ; j is a px 1 vector of regressors for observation 7 in group j. Expressing 
Yi js.. Ynj,j as a vector Y; and combining £1,j,..., 2n; j into an nj X p 
matrix X,;, the within-group sampling model can be expressed equivalently 
as Y; ~ multivariate normal (X;6;, o7I), with the group-specific data vectors 
Y1,...;¥m being conditionally independent given 3,,...,3,, and o°. 
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The heterogeneity among the regression coefficients G,,...,3,,, will be de- 
scribed with a between-group sampling model. If we have no prior information 
distinguishing the different groups we can model them as being exchangeable, 
or (roughly) equivalently, as being ii.d. from some distribution represent- 
ing the sampling variability across groups. The normal hierarchical regression 
model describes the across-group heterogeneity with a multivariate normal 
model, so that 


B1; -- -Bm ~ iid. multivariate normal(@, X). (11.2) 


A graphical representation of the hierarchical model appears in Figure 11.2, 
which makes clear that the multivariate normal distribution for B,,...,G,, 
is not a prior distribution representing uncertainty about a fixed but un- 
known quantity. Rather, it is a sampling distribution representing hetero- 
geneity among a collection of objects. The values of 0 and X are fixed but 
unknown parameters to be estimated. 


Fig. 11.2. A graphical representation of the hierarchical normal regression model. 


This hierarchical regression model is sometimes called a linear mixed ef- 
fects model. This name is motivated by an alternative parameterization of 
Equations 11.1 and 11.2. We can rewrite the between-group sampling model 
as 


By = 8+; 
Y1 ---;Ym © iid. multivariate normal(0, X). 


Plugging this into our within-group regression model gives 
Y; = pF xi Se 
= O'zi; + Yj Bij + Ei j- 


In this parameterization 0 is referred to as a fixed effect as it is constant 
across groups, whereas 71,...,7Y,, are called random effects, as they vary. 
The name “mixed effects model” comes from the fact that the regression 
model contains both fixed and random effects. Although for our particular 
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example the regressors corresponding to the fixed and random effects are the 
same, this does not have to be the case. A more general model would be 
Y; j = Or zij + YT zij + €i j, where x; j and z; j could be vectors of different 
lengths which may or may not contain overlapping variables. In particular, 
zi j might contain regressors that are group specific, that is, constant across 
all observations in the same group. Such variables are not generally included 
in Z; j, as there would be no information in the data with which to estimate 
the corresponding group-specific regression coefficients. 

Given a prior distribution for (0, X,o?°) and having observed Yı = 
Y1,---,¥m = Ym, a Bayesian analysis proceeds by computing the posterior 
distribution p(1,..., Bm,9,07,07|X1,..-, Xm, Y1; ---; Ym). If semiconjugate 
prior distributions are used for 0, X and o?, then the posterior distribution 
can be approximated quite easily with Gibbs sampling. The classes of semi- 
conjugate prior distributions for 0 and X are as in the multivariate normal 
model discussed in Chapter 7. The prior we will use for g? is the usual inverse- 
gamma distribution. 


6 ~ multivariate normal( uo, Ao) 
X ~ inverse-Wishart(1o, So +) 


a° ~ inverse-gamma(vo/2, voog /2) 


11.2 Full conditional distributions 


While computing the posterior distribution for so many parameters may seem 
daunting, the calculations involved in computing the full conditional distri- 
butions have the same mathematical structure as models we have studied 
in previous chapters. Once we have the full conditional distributions we can 
iteratively sample from them to approximate the joint posterior distribution. 


Full conditional distributions of B1,..., Bm 


Our hierarchical regression model shares information across groups via the 
parameters 0, X and o?. As a result, conditional on 0, X, øg? the regression 
coefficients 3,,...,,,, are independent. Referring to the graph in Figure 11.2, 
from the perspective of a given 6; the model looks like an ordinary one-group 
regression problem where the prior mean and variance for 3; are @ and X. 
This analogy is in fact correct, and the results of Section 9.2.1 show that 
{B;\y;,X;,9, ©,07} has a multivariate normal distribution with 


Var[B,|y;,%5,07, 9, X] = (57! I XT ey a 
E(B, |y;,Xj,07,9, 2] = (57! hy X? X,;/o7) (5710 +X7y;/o). 
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Full conditional distributions of 0 and X 


Our sampling model for the G,’s is that they are i.i.d. samples from a mul- 
tivariate normal population with mean @ and variance X. Inference for the 
population mean and variance of a multivariate normal population was cov- 
ered in Chapter 7, in which we derived the full conditional distributions when 
semiconjugate priors are used. There, we saw that the full conditional distri- 
bution of a population mean is multivariate normal with expectation equal to 
a combination of the prior expectation and the sample mean, and precision 
equal to the sum of the prior and data precisions. In the context of the hi- 
erarchical regression model, given X and our sample of regression coefficients 
B,,---; Bm, the full conditional distribution of @ is as follows: 


{9|B,,..-,8,,,£:} ~ multivariate normal(p,,, Am) , where 
Am = (Ag? +mZ71)-4 


where 3 is the vector average = ys, B;. In Chapter 7 we also saw that the full 
conditional distribution of a covariance matrix is an inverse-Wishart distribu- 
tion, with sum of squares matrix equal to the prior sum of squares So plus 
the sum of squares from the sample: 


{Z|0,B,,---, Bm} ~ inverse-Wishart(no + m, [So + S9]~*) , where 


m 


So = $ (8; — 6)(8; - 9)". 


j=l 
Note that Sg depends on @ and so must be recomputed each time @ is updated 
in the Markov chain. 


Full conditional distribution of o? 


The parameter g? represents the error variance, assumed to be common across 
all groups. As such, conditional on 681,..., Bm, the data provide information 
about øg? via the sum of squared residuals from each group: 


a? ~ inverse-gamma([vp + Ye nj]/2, [voog + SSR]/2) , where 


m nj 


SSR = X X (vii — 07 Big)”. 


j=1 i=1 


It is important to remember that SSR depends on the value of 8;, and so SSR 
must be recomputed in each scan of the Gibbs sampler before g? is updated. 
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11.3 Posterior analysis of the math score data 


To analyze the math score data we will use a prior distribution that is similar 
in spirit to the unit information priors that were discussed in Chapter 9. For 
example, we'll take ug, the prior expectation of 8, to be equal to the average of 
the ordinary least squares regression estimates and the prior variance Ag to be 
their sample covariance. Such a prior distribution represents the information 
of someone with unbiased but weak prior information. For example, a 95% 
prior confidence interval for the slope parameter 62 under this prior is (- 
3.86,8.60), which is quite a large range when considering what the extremes 
of the interval imply in terms of average change in score per unit change in 
SES score. Similarly, we will take the prior sum of squares matrix Sp to be 
equal to the covariance of the least squares estimate, but we’ll take the prior 
degrees of freedom 79 to be p+ 2 = 4, so that the prior distribution of X 
is reasonably diffuse but has an expectation equal to the sample covariance 
of the least squares estimates. Finally, we’ll take of to be the average of the 
within-group sample variance but set vp = 1. 
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Fig. 11.3. Relationship between SES and math score. The first panel plots the 
posterior density of the expected slope 02 of a randomly sampled school, as well 
as the posterior predictive distribution of a randomly sampled slope. The second 
panel gives posterior expectations of the 100 school-specific regression lines, with 
the average line given in black. 


Running a Gibbs sampler for 10,000 scans and saving every 10th scan 
produces a sequence of 1,000 values for each parameter, each sequence having 
a fairly low autocorrelation. For example, the lag-10 autocorrelations of 6; and 
ə are -0.024 and 0.038. As usual, we can use these simulated values to make 
Monte Carlo approximations to various posterior quantities of interest. For 
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example, the first plot in Figure 11.3 shows the posterior distribution of 62, 
the expected within-school slope parameter. A 95% quantile-based posterior 
confidence interval for this parameter is (1.83, 2.96), which, compared to our 
prior interval of (-3.86, 8.60), indicates a strong alteration in our information 
about 69. 

The fact that 62 is extremely unlikely to be negative only indicates that the 
population average of school-level slopes is positive. It does not indicate that 
any given within-school slope cannot be negative. To clarify this distinction, 
the posterior predictive distribution of (2, the slope for a to-be-sampled school, 
is plotted in the same figure. Samples from this distribution can be generated 


by sampling a value B® from a multivariate normal(@?, 3(8)) distribution 
for each scan s of the Gibbs sampler. Notice that this posterior predictive 
distribution is much more spread out than the posterior distribution of 02, 
reflecting the heterogeneity in slopes across schools. Using the Monte Carlo 
approximation, we have Pr(6ə < Olyy,.--, Ym: X1,- --; Xm) ~ 0.07, which is 
small but not negligible. 

The second panel in Figure 11.3 plots posterior expectations of the 100 
school-specific regression lines, with the line given by the posterior mean of 0 
in black. Comparing this to the first panel of Figure 11.1 indicates how the 
hierarchical model is able to share information across groups, shrinking ex- 
treme regression lines towards the across-group average. In particular, hardly 
any of the slopes are negative when we share information across groups. 


11.4 Generalized linear mixed effects models 


As the name suggests, a generalized linear mixed effects model combines as- 
pects of linear mixed effects models with those of generalized linear models 
described in Chapter 10. Such models are useful when we have a hierarchical 
data structure but the normal model for the within-group variation is not 
appropriate. For example, if the variable Y were binary or a count, then more 
appropriate models for within-group variation would be logistic or Poisson 
regression models, respectively. 
A basic generalized linear mixed model is as follows: 


B,,.--,B, ~ iid. multivariate normal(9, X) 


i=1 


with observations from different groups also being conditionally independent. 
In this formulation p(y|@7 a, 7) is a density whose mean depends on 37 a, and 
yis an additional parameter often representing variance or scale. For example, 
in the normal model p(y|3" x,y) = dnorm(y, 3’ x, 7/2) where y represents 
the variance. In the Poisson model p(y|3"7 x) = dpois(exp{ 6" «}), and there 
is no y parameter. 
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11.4.1 A Metropolis-Gibbs algorithm for posterior approximation 


Estimation for the linear mixed effects model was straightforward because 
the full conditional distribution of each parameter was standard, allowing for 
the easy implementation of a Gibbs sampling algorithm. In contrast, for non- 
normal generalized linear mixed models, typically only 0 and X have standard 
full conditional distributions. This suggests we use a Metropolis-Hastings al- 
gorithm to approximate the posterior distribution of the parameters, using 
a combination of Gibbs steps for updating (0, X) with a Metropolis step for 
updating each 6;. In what follows we assume there is no y parameter. If there 
is such a parameter, it can be updated using a Gibbs step if a full conditional 
distribution is available, and a Metropolis step if not. 


Gibbs steps for (0, X) 


Just as in the linear mixed effects model, the full conditional distributions of 0 
and X depend only on (,,...,(,,. This means that the form of p(y|3" x) has 
no effect on the full conditional distributions of @ and X. Whether p(y|@7 a) 
is a normal model, a Poisson model, or some other generalized linear model, 
the full conditional distributions of 0 and X will be the multivariate normal 
and inverse-Wishart distributions described in Section 11.2. 


Metropolis step for B; 


Updating 6; in a Markov chain can proceed by proposing a new value of B; 
based on the current parameter values and then accepting or rejecting it with 
the appropriate probability. A standard proposal distribution in this situation 
would be a multivariate normal distribution with mean equal to the current 
value a and with some proposal variance ae In this case the Metropolis 


step is as follows: 


1. Sample 8} ~ multivariate normal (3°), vy, 
2. Compute the acceptance ratio 


E ply;lX;, B5610, 20) 
p(y |X, BP )p(B 0®, ©) 


3. Sample u ~ uniform(0,1). Set gp to B; if u < r and to g” ifu>r. 


In many cases, setting Vi? equal to a scaled version of 5“) produces a well- 
mixing Markov chain, although the task of finding the right scale might have 
to proceed by trial and error. 
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A Metropolis-Hastings approximation algorithm 


Putting these steps together results in the following Metropolis-Hastings algo- 
rithm for approximating p(,,...,8,,,9, 2 |X1,---,XmsY1s-+-;Ym): Given 
current values at scan s of the Markov chain, we obtain new values as follows: 


1. Sample 0+ from its full conditional distribution. 
2. Sample X(+» from its full conditional distribution. 
3. For each j € {1,...,mb}, 

a) propose a new value 8}; 


b) set go equal to 8} or a with the appropriate probability. 


11.4.2 Analysis of tumor location data 


(From Haigis et al (2004)) A certain population of laboratory mice experiences 
a high rate of intestinal tumor growth. One item of interest to researchers 
is how the rate of tumor growth varies along the length of the intestine. To 
study this, the intestine of each of 21 sample mice was divided into 20 sections 
and the number of tumors occurring in each section was recorded. The first 
panel of Figure 11.4 shows 21 lines, each one representing the observed tumor 
counts of each of the mice plotted against the fraction of the length along their 
intestine. Although it is hard to tell from the figure, the lines for some mice 
are consistently below the average (given in black), and others are consistently 
above. This suggests that tumor counts are more similar within a mouse than 
between mice, and a hierarchical model with mouse-specific effects may be 
appropriate. 

A natural model for count data such as these is a Poisson distribution 
with a log-link. Letting Y, į; be mouse j’s tumor count at location x of their 
intestine, we will model Y,,; as Poisson(e/- i(*)), where f; is a smoothly varying 
function of x € [0,1]. A simple way to parameterize fj is as a polynomial, so 
that f;(x) = bij + Bo,;0 + B3,;27 +- + Bp jP for some maximum degree 
p—1. Such a parameterization allows us to represent each fj as a regression 
on (1, £, £?,..., £P71). 

Averaging across the 21 mice gives an observed average tumor count Yr 
at each of the 20 locations x € (.05,.10,...,.95) along the intestine. This 
average curve is plotted in the first panel of Figure 11.4 in black, and the log 
of this curve is given in the second panel of the figure. Also in the second 
panel are approximations of this curve using polynomials of degree 2, 3 and 
4. The second- and third-degree approximations indicate substantial lack-of- 
fit, whereas the fourth-degree polynomial fits the log average tumor count 
function rather well. For simplicity we’ll model each fj as a fourth-degree 
polynomial, although it is possible that a particular f; may be better fit with 
a higher degree. 

Our between-group sampling model for the 8;’s will be as in the previous 
section, so that 3,,...,3,, ~ i.i-d. multivariate normal(@, X). Unconditional 
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Fig. 11.4. Tumor count data. The first panel gives mouse-specific tumor counts as 
a function of location in gray, with a population average in black. The second panel 
gives quadratic, cubic and quartic polynomial fits to the log sample average tumor 
count. 


on §,, the observations coming from a given mouse are statistically depen- 
dent as determined by X. Estimating X in this mixed effects model allows us 
to account for and describe potential within-mouse dependencies in the data. 
The unknown parameters in this model are 0 and X for which we need to 
specify prior distributions. Using conjugate normal and inverse-Wishart prior 
distributions, we need to specify ys and Ag for p(@) and np and So for p(X). 
Specifying reasonable values for this many parameters can be difficult, espe- 
cially in the absence of explicit prior data. As an alternative, we’ll take an 
approach based on unit information priors, in which the prior distributions 
for the parameters are weakly centered around estimates derived from the ob- 
served data. As mentioned before, such prior distributions might represent the 
prior information of someone with a small amount of unbiased information. 

Our unit information prior requires estimates of 0 and X, the population 
mean and covariance of the (,’s. For each mouse we can obtain a prelimi- 
nary ad hoc estimate B; by regressing {log(y1,; + 1/n),...,log(yn,; + 1/n)} 
on {£1,..., £20}, where æ; = (1, x;, £2, £3, x})T for x; € (.05,.10,...,.95) 
(alternatively, we could obtain B; using maximum likelihood estimates from 
a Poisson regression model). A unit-information type of prior distribution 
for O would be a multivariate normal distribution with an expectation of 
H= + j=1 B; and a prior covariance matrix equal to the sample covariance 
of the B; ’s. We also set So equal to this sample covariance matrix, but set 
no = p +2 = 7, so that the prior expectation of X is equal to Sg but the prior 
distribution is relatively diffuse. 
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In terms of MCMC posterior approximation, recall from the steps out- 
lined in the previous subsection that values of 0 and X can be sampled from 
their full conditional distributions. The full conditional distributions of the 
,’s, however, are not standard and so we'll propose changes to these param- 
eters from distributions centered around their current values. After a bit of 
trial and error, it turns out that a multivariate normal(3°, 38) /2) proposal 
distribution yields an acceptance rate of about 31% and a reasonably well- 
mixing Markov chain. Running the Markov chain for 50,000 scans and saving 
the values every 10th scan gives 5,000 approximate posterior samples for each 
parameter. The effective sample sizes for the elements of X are all above 1,000 
except for that of X11, which was about 950. The effective sample sizes for the 
five 0 parameters are (674, 1003, 1092, 1129, 1145). This roughly means that 
our Monte Carlo standard error in approximating E[61|y,,...,Yj,,X%], for ex- 
ample, is /674 = 25.96 times smaller than the posterior standard error of 64. 
Of course, we can reduce the Monte Carlo standard error to be as small as we 
want by running the Markov chain for more iterations. R-code for generating 
this Markov chain appears below: 


tHe data 
data(chapter11) 
Y<XY.tumor$Y ; X<—XY.tumor$X ; mdim(Y)[1] ; p<-dim(X) [2] 


#4 priors 
BETA<—NULL 


ior tin Tem) 


BETA<-rbind (BETA, lm (log (Y [j ,]+1/20)7—1+X[,,j])$coef) 
} 


mu0<—apply (BETA, 2 , mean) 
S0<—cov (BETA) ; eta0<—p+2 
iLO iSigma<solve (S0) 


#44 MCMC 
THETA. post <-NULL ; set .seed(1) 
for(s in 1:50000) 
{ 
#Hfupdate theta 
Lm~—solve( iLO + mxiSigma ) 
munx—Lim%«*%( iLO%%anu0 + iSigma%%apply (BETA,2 ,sum) ) 
theta<t (rmvnorm (1 ,mum,Im) ) 
Hf 


#Htupdate Sigma 
mtheta<—matrix (theta ,m,p, byrow=IRUE) 
iSigma<rwish (1, eta0+m, 
solve( S0+t (BETA-mtheta)%*%(BETA-mtheta)) ) 
Ht 
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##update beta 

Sigma<—solve(iSigma) ; dSigma<—det (Sigma) 
ore (4) un ilem) 

{ 


beta. p<—t (rmvnorm (1 ,BETA[j ,] ,.5* Sigma) ) 


Ir<sum( dpois(Y[j,],exp(X[, , j]%*%beta.p) ,log=MRUE ) — 
dpois(Y|[j ,] ,exp(X|, , j]%*%BETA|[j ,]) , log=TRUE ) ) + 
Idmvnorm( t(beta.p),theta ,Sigma, 
iSigma=iSigma ,dSigma=dSigma ) — 
Idmvnorm( t(BETA[j ,]) , theta ,Sigma, 
iSigma=iSigma ,dSigma=dSigma_ ) 


cu log(runif(1))<1lr ) { BETA[j,|<—beta.p } 
HF 


#Htstore some output 
if (s%%10==0){THETA. post <-rbind (THETA. post ,t (theta ))} 
Ht 
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Fig. 11.5. 2.5, 50 and 97.5% quantiles for exp(07 a), exp(@7 a) and {Y |z}. 


The three panels in Figure 11.5 show posterior distributions for a variety 
of quantities. The first panel gives 2.5%, 50% and 97.5% posterior quantiles 
for exp(@" x). The second panel gives the same quantiles for the posterior 
predictive distribution of exp(@’ æ). The difference in the width of the confi- 
dence bands is due to the estimated across-mouse heterogeneity. If there were 
no across-mouse heterogeneity then X would be zero, each 8, would be equal 
to 0 and the plots in the first two panels of the figure would be identical. 
Finally, the third panel gives 2.5%, 50% and 97.5% quantiles of the posterior 
predictive distribution of {Y |æ} for each of the 20 values of x. The difference 
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between this plot and the one in the second panel is due to the variability 
of a Poisson random variable Y around its expected value exp(@7 a). The 
widening confidence bands of the three plots in this figure describe cumula- 
tive sources of uncertainty: The first panel shows the uncertainty in the fixed 
but unknown value of 0. The second panel shows this uncertainty in addition 
to the uncertainty due to across-mouse heterogeneity. Finally, the third panel 
includes both of these sources of uncertainty as well as that due to fluctu- 
ations of a mouse’s observed tumor counts around its own expected tumor 
count function. Understanding these different sources of uncertainty can be 
very relevant to inference and decision making: For example, if we want to 
predict the observed tumor count distribution of a new mouse, we should use 
the confidence bands in the third panel, whereas the bands in the first panel 
would be appropriate if we just wanted to describe the uncertainty in the fixed 
value of 8. 


11.5 Discussion and further references 


Posterior approximation via MCMC for hierarchical models can suffer from 
poor mixing. One reason for this is that many of the parameters in the model 
are highly correlated, and generating them one at a time in the Gibbs sampler 
can lead to a high degree of autocorrelation. For example, 0 and the (,’s are 
positively correlated, and so an extreme value of 0 at one iteration can lead 
to extreme values of the B;’s when they get updated, especially if the amount 
of within-group data is low. This in turn leads to an extreme value of 0 at the 
next iteration. Section 15.4 of Gelman et al (2004) provides a detailed discus- 
sion of several alternative Gibbs sampling strategies for improving mixing for 
hierarchical models. Improvements also can be made by careful reparameter- 
izations of the model (Gelfand et al, 1995; Papaspiliopoulos et al, 2007). 


12 


Latent variable methods for ordinal data 


Many datasets include variables whose distributions cannot be represented 
by the normal, binomial or Poisson distributions we have studied thus far. 
For example, distributions of common survey variables such as age, education 
level and income generally cannot be accurately described by any of the above- 
mentioned sampling models. Additionally, such variables are often binned into 
ordered categories, the number of which may vary from survey to survey. In 
such situations, interest often lies not in the scale of each individual variable, 
but rather in the associations between the variables: Is the relationship be- 
tween two variables positive, negative or zero? What happens if we “account” 
for a third variable? For normally distributed data these types of questions 
can be addressed with the multivariate normal and linear regression models 
of Chapters 7 and 9. In this chapter we extend these models to situations 
where the data are not normal, by expressing non-normal random variables 
as functions of unobserved, “latent” normally distributed random variables. 
Multivariate normal and linear regression models then can be applied to the 
latent data. 


12.1 Ordered probit regression and the rank likelihood 


Suppose we are interested in describing the relationship between the edu- 
cational attainment and number of children of individuals in a population. 
Additionally, we might suspect that an individual’s educational attainment 
may be influenced by their parent’s education level. The 1994 General Social 
Survey provides data on variables DEG, CHILD and PDEG for a sample of 
individuals in the United States, where DEG; indicates the highest degree 
obtained by individual 7, CHILD; is their number of children and PDEG; is 
the binary indicator of whether or not either parent of i obtained a college 
degree. Using these data, we might be tempted to investigate the relationship 
between the variables with a linear regression model: 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_12, 
© Springer Science+Business Media, LLC 2009 
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DEG; = By + Bo x CHILD; + B3 x PDEG; + Ba x CHILD; x PDEG; Fi; 


where we assume that ¢€1,... , €n ~ i.i.d. normal(0, 07). However, such a model 
would be inappropriate for a couple of reasons. Empirical distributions of DEG 
and CHILD for a sample of 1,002 males in the 1994 workforce are shown in 
Figure 12.1. The value of DEG is recorded as taking a value in {1,2,3,4,5} 
corresponding to the highest degree of the respondent being no degree, high 


school degree, associate’s degree, bachelor’s degree, or graduate degree. 
n] 4 
© 
© 
MN 4 
+ (æ 
ae 
Ba | Bo 
eo 3S7 
ro ce) 
ga £ 7 
aoc] Q 
=o 
z S 
s] | | 
z s 4 I l ' 1 
T | | | l a a a a a O a S 
1 2 3 4 5 0 1 2 3 4 5 6 7 8 
DEG CHILD 


Fig. 12.1. Two ordinal variables having non-normal distributions. 


Since the variable DEG takes on only a small set of discrete values, the 
normality assumption of the residuals will certainly be violated. But perhaps 
more importantly, the regression model imposes a numerical scale to the data 
that is not really present: A bachelor’s degree is not “twice as much” as a 
high school degree, and an associate’s degree is not “two less” than a graduate 
degree. There is an order to the categories in the sense that a graduate degree 
is “higher” than a bachelor’s degree, but otherwise the scale of DEG is not 
meaningful. 

Variables for which there is a logical ordering of the sample space are 
known as ordinal variables. With this definition, the discrete variables DEG 
and CHILD are ordinal variables, as are “continuous” variables like height or 
weight. However, CHILD, height and weight are variables that are measured 
on meaningful numerical scales, whereas DEG is not. In this chapter we will 
use the term “ordinal” to refer to any variable for which there is a logical 
ordering of the sample space. We will use the term “numeric” to refer to 
variables that have meaningful numerical scales, and “continuous” if a variable 
can have a value that is (roughly) any real number in an interval. For example, 
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DEG is ordinal but not numeric, whereas CHILD is ordinal, numeric and 
discrete. Variables like height or weight are ordinal, numeric and continuous. 


12.1.1 Probit regression 


Linear or generalized linear regression models, which assume a numeric scale 
to the data, may be appropriate for variables like CHILD, height or weight, 
but are not appropriate for non-numeric ordinal variables like DEG. However, 
it is natural to think of many ordinal, non-numeric variables as arising from 
some underlying numeric process. For example, the severity of a disease might 
be described “low”, “moderate” or “high”, although we imagine a patient’s 
actual condition lies within a continuum. Similarly, the amount of effort a 
person puts into formal education may lie within a continuum, but a survey 
may only record a rough, categorized version of this variable, such as DEG. 
This idea motivates a modeling technique known as ordered probit regression, 
in which we relate a variable Y to a vector of predictors x via a regression in 
terms of a latent variable Z. More precisely, the model is 


€1,---,€, ~ iid. normal(0, 1) 
Yi = (Zi), (12.2) 


where 8 and g are unknown parameters. For example, to model the condi- 
tional distribution of DEG given CHILD and PDEG we would let Y; be DEG; 
and let x; = (CHILD;, PDEG;, CHILD; x PDEG;). The regression coefficients 
B describe the relationship between the explanatory variables and the unob- 
served latent variable Z, and the function g relates the value of Z to the 
observed variable Y. The function g is taken to be non-decreasing, so that 
we can interpret small and large values of Z as corresponding to small and 
large values of Y. This also means that the sign of a regression coefficient (3; 
indicates whether Y is increasing or decreasing in £j. 

Notice that in this probit regression model we have taken the variance 
of €1,...,€, to be one. This is because the scale of the distribution of Y can 
already be represented by g, as g is allowed to be any non-decreasing function. 
Similarly, g can represent the location of the distribution of Y, and so we do 
not need to include an intercept term in the model. 

If the sample space for Y takes on K values, say {1,..., K}, then the 
function g can be described with only K — 1 ordered parameters gı < g2 < 
+++ < gk—1 as follows: 


y= 9(z)= lif -o=g<z<H 
= 2if gı < Z < 92 
(12.3) 


=K if GK-1<2<gK =O. 
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The values {g1,...,9xK—1} can be thought of as “thresholds,” so that mov- 
ing z past a threshold moves y into the next highest category. The un- 
known parameters in the model include the regression coefficients G and the 
thresholds g1,...,gg—1ı. If we use normal prior distributions for these quan- 
tities, the joint posterior distribution of {8,91,...,gK-1,Z1,.--;Zn} given 
Y = y = (y1,---,Yn) can be approximated using a Gibbs sampler. 


Full conditional distribution of B 


Given Y = y, Z = z, and g = (g1...,gxK-1), the full conditional dis- 
tribution of B depends only on z and satisfies p(Bly,z,g) x p(B)p(z|B). 
Just as in ordinary regression, a multivariate normal prior distribution for 
6B gives a multivariate normal posterior distribution. For example, if we use 
B ~ multivariate normal(0,n(X’X)~!), then p(A|z) is multivariate normal 
with 


Var|[B|z] = (XP x)", and 


E[6lz] = —~(K?X)-! x? z. 


n+l 


Full conditional distribution of Z 


The full conditional distributions of the Z;’s are only slightly more compli- 
cated. Under the sampling model, the conditional distribution of Z; given 3 
is Z; ~ normal( 8" a;, 1). Given g, observing Y; = y; tells us that Z; must lie 
in the interval (gy;—1, gy,). Letting a = gy;—1 and b = gy,, the full conditional 
distribution of Z; given {G,y,g} is then 


p(z:|8,y,g) x dnorm(z;, 37 x;, 1) x Òa b) (24): 


This is the density of a constrained normal distribution. To sample a value x 
from a normal(u, 0?) distribution constrained to the interval (a, b), we perform 
the following two steps: 


1. sample u ~ uniform(®[(a — u)/o], &[(b — ps) /o}) 
2. set z = u + 0o87! (u) 


where © and @~! are the cdf and inverse-cdf of the standard normal distri- 
bution (given by pnorm and qnorm in R). Code to sample from the full 
conditional distribution of Z; is as follows: 


ez<— t(beta)%*%X/i , | 
a<—max(—Inf ,g[y [i] —1],na.rm=IRUE) 
b<min(g|y[i]] , Inf ,na.rm=PRUE) 


u<-runif(1, pnorm(a—ez),pnorm(b—ez) ) 
z[i]<— ez + qnorm(u) 


The added complexity in assigning a and b in the above code is to deal with 
the special cases gg = —co and gg = œ. 
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Full conditional distribution of g 


Suppose the prior distribution for g is some arbitrary density p(g). Given 
Y = y and Z = z, we know from Equation 12.3 that g, must be higher 
than all z;’s for which y; = k and lower than all z;’s for which y; = k + 1. 
Letting a, = max{z; : yi = k} and by = min{z; : yi = k + 1} the full 
conditional distribution of g is then proportional to p(g) but constrained to 
the set {g : ak < gx < b,x}. For example, if p(g) is proportional to the product 
Ha dnorm(gk, Hk, ak) but constrained so that gı < -+--> < gx—1, then the 
full conditional density of gj, is a normal(ji,,07) density constrained to the 
interval (ax, by). R-code to sample from the full conditional distribution of gx 
is given below: 

a<—max (z | y==k] ) 

b<—min ( z | y==k+1]) 


u<runif(1,pnorm((a—mu[k])/sig[k]) ,pnorm((b—mu[k])/sig[k]) ) 
g[k]<— mu[k] + sig [k]*qnorm(u) 


Example: Educational attainment 


Some researchers suggest that having children reduces opportunities for ed- 
ucational attainment (Moore and Waite, 1977). Here we examine this hy- 
pothesis in a sample of males in the labor force (meaning not retired, not 
in school and not in an institution), obtained from the 1994 General Social 
Survey. For 959 of the 1,002 survey respondents we have complete data on 
the variables DEG, CHILD and PDEG described above. Letting Y; = DEG; 
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Fig. 12.2. Results from the probit regression analysis. 
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and x; = (CHILD;, PDEG;, CHILD; x PDEG,), we will estimate the param- 
eters in the ordered probit regression model using prior distributions of 8 ~ 
multivariate normal(0,n(X’X)~!) and p(g) « i dnorm(gķ, 0,100) but 
constrained so that gı < --- < gx—1. We’ll approximate the corresponding 
posterior distribution of {8, Z,g} with a Gibbs sampler consisting of 25,000 
scans. Saving parameter values every 25th scan results in 1,000 values for each 
parameter with which to approximate the posterior distribution. The posterior 
mean regression line for people without a college-educated parent (xi 2 = 0) 
is E[Z|y, £1, x2 = 0] = —0.024 x x, while the regression line for people with a 
college-educated parent is E[Z|y, £1, £2 = 1] = 0.818 + 0.054 x xı. These lines 
are shown in the first panel of Figure 12.2, along with the value of Z that 
was obtained in the last scan of the Gibbs sampler. The lines suggest that 
for people whose parents did not go to college, the number of children they 
have is indeed weakly negatively associated with their educational outcome. 
However, the opposite seems to be true among people whose parents went to 
college. The posterior distribution of (3 is given in the second panel of the fig- 
ure, along with the prior distribution for comparison. The 95% quantile-based 
posterior confidence interval for (3 is (-0.026,0.178), which contains zero but 
still represents a reasonable amount of evidence that the slope for the x2 = 1 
group is larger than that of the x2 = 0 group. 


12.1.2 Transformation models and the rank likelihood 


The analysis of the educational attainment data above required us to specify a 
prior distribution for G and the transformation g(z), as specified by the vector 
g of K —1 threshold parameters. While simple default prior distributions for 
B exist (such as Zellner’s g-prior), the same is not true for g. Coming up with 
a prior distribution for g that represents actual prior information seems like a 
difficult task. Of course, this task is much harder if the number of categories 
K is large. For example, the incomes (INC) of the subjects in the 1994 GSS 
dataset were each recorded as belonging to one of 21 ordered categories, so that 
a regression in which Y; = INC; would require that g includes 20 parameters. 
Estimation and prior specification for such a large number of parameters can 
be difficult. 

Fortunately there is an alternative approach to estimating @ that does not 
require us to estimate the function g(z). Note that if the Z;’s were observed 
directly, then we could ignore Equation (12.2) of the model and we would 
be left with an ordinary regression problem without having to estimate the 
transformation g(z). Unfortunately we do not observe the Z;’s directly, but 
there is information in the data about the Z;’s that does not require us to 
specify g(z): Since we know that g is non-decreasing, we do know something 
about the order of the Z;’s. For example, if our observed data are such that 
yı > Y2, then since y; = g(Z;), we know that g(Z,) > g(Z2). Since g is 
non-decreasing, this means that we know Z, > Zə. In other words, having 
observed Y = y, we know that the Z;’s must lie in the set 
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Ry) = {z E R” : zi < zia if Yi < Yin} 


Since the distribution of the Z;’s does not depend on g, the probability that 
Z € R(y) for a given y also does not depend on the unknown function g. This 
suggests that we base our posterior inference on the knowledge that Z € R(y). 
Our posterior distribution for Ø in this case is given by 


p(B|Z € R(y)) x p(B) x Pr(Z € R(y)|S) 


n 
= p(B) x I. | [ dnorm(zi, 8” xi, 1) dz;. 
R(y)i=1 

As a function of 8, the probability Pr(Z € R(y)|6) is known as the rank 
likelihood. For continuous y-variables this likelihood was introduced by Pettitt 
(1982) and its theoretical properties were studied by Bickel and Ritov (1997). 
It is called a rank likelihood because for continuous data it contains the same 
information about y as knowing the ranks of {y1,...,Yn}, i.e. which one has 
the highest value, which one has the second highest value, etc. If Y is discrete 
then observing R(y) is not exactly the same as knowing the ranks, but for 
simplicity we will still refer to Pr(Z € R(y)|3) as the rank likelihood, whether 
or not Y is discrete or continuous. The important thing to note is that for any 
ordinal outcome variable Y (non-numeric, numeric, discrete or continuous), 
information about 68 can be obtained from Pr(Z € R(y)|B) without having 
to specify g(z). 

For any given 8 the value of Pr(Z € R(y)|G) involves a very complicated 
integral that is difficult to compute. However, by estimating Z simultaneously 
with @ we can obtain an estimate of B without ever having to numerically 
compute Pr(Z € R(y)|B). The joint posterior distribution of {G,Z} can be 
approximated by using Gibbs sampling, alternately sampling from full condi- 
tional distributions. The full conditional distribution of 68 is very easy: Given 
a current value z of Z, the full conditional density p(G|Z = z, Z € R(y)) re- 
duces to p(G|Z = z) because knowing the value of Z is more informative than 
knowing just that Z lies in the set R(y). A multivariate normal prior distribu- 
tion for B then results in a multivariate normal full conditional distribution, 
as before. The full conditional distributions of the Z;’s are also straightfor- 
ward to derive. Let’s consider the full conditional distribution of Z; given 
{B,Z € R(y), z-i}, where z_; denotes the values of all of the Z’s except Zi. 
Conditional on 3, Z; is normal(@" «;, 1). Conditional on {8, Z € R(y), z-i}, 
the density of Z; is proportional to a normal density but constrained by the 
fact that Z € R(y). Let’s recall the nature of this constraint: y; < yj implies 
Zi < Zj, and y; > yj implies Z; > Zj. This means that Z; must lie in the 
following interval: 


max{z; : yj < Yi} < Zi < min{z; : Yi < Yj}. 


Letting a and b denote the numerical values of the lower and upper endpoints 
of this interval, the full conditional distribution of Z; is then 
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p(zilB, Z E€ Rly), z-i) x dnorm(z;, 3” x;, 1) x Ôla,b) (Zi). 


This full conditional distribution is exactly the same as that of Z; in the or- 
dered probit model, except that now the constraints on Z; are determined 
directly by the current value Z_;, instead of on the threshold variables. As 
such, sampling from this full conditional distribution is very similar to sam- 
pling from the analogous distribution in the probit regression model: 


ez<— t(beta)%*%X[i,] 
a<-—max(z[y<y[i]]) 
b<-min(z[y[i]<y]) 


u<-runif(1, pnorm(a—ez),pnorm(b—ez) ) 
z[i]j<— ez + gnorm(u) 


Not surprisingly, for the educational attainment data the posterior distri- 
bution of 8B based on the rank likelihood is very similar to the one based on 
the full ordered probit model. The three panels of Figure 12.3 indicate that 
the marginal posterior densities of 81, G2 and (@3 are nearly identical under 
the two models. In general, if K is small and n is large, we expect the two 
methods to behave similarly. However, the rank likelihood approach is appli- 
cable to a wider array of datasets since with this approach, Y is allowed to be 
any type of ordinal variable, discrete or continuous. The drawback to using 
the rank likelihood is that it does not provide us with inference about g(z), 
which describes the relationship between the latent and observed variables. If 
this parameter is of interest, then the rank likelihood is not appropriate, but 
if interest lies only in Ø, then the rank likelihood provides a simple alternative 
to the ordered probit model. 
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Fig. 12.3. Marginal posterior distributions of (81, 32, 83), under the ordinal probit 
regression model (in gray) and the rank likelihood (in black). 
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12.2 The Gaussian copula model 


The regression model above is somewhat limiting because it only describes 
the conditional distribution of one variable given the others. In general, we 
may be interested in the relationships among all of the variables in a dataset. 
If the variables were approximately jointly normally distributed, or at least 
were all measured on a meaningful numerical scale, then we could describe 
the relationships among the variables with the sample covariance matrix or a 
multivariate normal model. However, such a model is inappropriate for non- 
numeric ordinal variables like INC, DEG and PDEG. To accommodate vari- 
ables such as these we can extend the ordered probit model above to a latent, 
multivariate normal model that is appropriate for all types of ordinal data, 
both numeric and non-numeric. Letting Y1,..., Y „ be i.i.d. random samples 
from a p-variate population, our latent normal model is 


Zi,..-,4n ~ iid. multivariate normal(0, Y) (12.4) 
Vig = g;(Zi j), (12.5) 


where gi,-.-,9p) are non-decreasing functions and W is a correlation matrix, 
having diagonal entries equal to 1. In this model, the matrix W represents the 
joint dependencies among the variables and the functions gj,..., gp represent 
their marginal distributions. To see how the g;’s represent the margins, let’s 
calculate the marginal cdf F)(y) of a continuous random variable Y;,; under 
the model given by Equations 12.4 and 12.5. Recalling the definition of the 
cdf, we have 


where ®(z) is the cdf of the standard normal distribution. The last line holds 
because the diagonal entries of W are all equal to 1, and so the marginal 
distribution of each Z; j is a standard normal distribution with cdf &(z). 
The above calculations show that Fj(y) = P(g; (y)), indicating that the 
marginal distributions of the Y;’s are fully determined by the g;’s and do 
not depend on the matrix W. A model having separate parameters for the 
univariate marginal distributions and the multivariate dependencies is gen- 
erally called a copula model. The model given by Equations 12.4 and 12.5, 
where the dependence is described by a multivariate normal distribution, is 
called the multivariate normal copula model. The term “copula” refers to the 
method of “coupling” a model for multivariate dependence (such as the mul- 
tivariate normal distribution) to a model for the marginal distributions of 
the data. As shown above, a copula model separates the parameters for the 
dependencies among the variables (W) from the parameters describing their 
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univariate marginal distributions (g1, . . - , 9p). This separation comes in handy 
if we are primarily interested in the dependencies among the variables and not 
the univariate scales on which they were measured. In this case, the functions 
g1, ---,9p are nuisance parameters and the parameter of interest is W. Using 
an extension of the rank likelihood described in the previous section, we will 
be able to obtain a posterior distribution for W without having to estimate or 


specify prior distributions for the nuisance parameters g1,..., 9p- 


12.2.1 Rank likelihood for copula estimation 


The unknown parameters in the copula model are the matrix W and the non- 
decreasing functions g1,..., 9p». Bayesian inference for all of these parameters 
would require that we specify a prior for YW as well as p prior distributions 
over the complicated space of arbitrary non-decreasing functions. If we are 
not interested in g1,...,g,) then we can use a version of the rank likelihood 
which quantifies information about Z1,..., Zn without having to specify these 
nuisance parameters. Recall that since each g; is non-decreasing, observing the 
n x p data matrix Y tells us that the matrix of latent variables Z must lie in 
the set 

R(Y) = {Z : Zing S Ziz,j if Yii, j < Ua} (12.6) 


The probability of this event, Pr(Z € R(Y)|W), does not depend on gi,..., gp- 
As a function of Y, Pr(Z € R(Y)|W) is called the rank likelihood for the 
multivariate normal copula model. Computing the likelihood for a given value 
of W is very difficult, but as with the regression model in Section 12.1.2 we can 
make an MCMC approximation to p(W, Z|Z € R(Y)) using Gibbs sampling, 
provided we use a prior for W based on the inverse-Wishart distribution. 


A parameter-expanded prior distribution for W 


Unfortunately there is no simple conjugate class of prior distributions for our 
correlation matrix W. As an alternative, let’s consider altering Equation 12.4 
to be 

Z1,.-.,Zn ~ iid. multivariate normal(0, X), 


where X is an arbitrary covariance matrix, not restricted to be a correlation 
matrix like W. In this case a natural prior distribution for X would be an 
inverse-Wishart distribution, which would give an inverse-Wishart full con- 
ditional distribution and thus make posterior inference available via Gibbs 
sampling. However, careful inspection of the rank likelihood indicates that 
it does not provide us with a complete estimate of X. Specifically, the rank 
likelihood contains only information about the relative ordering among the 
Zi j’s, and no information about their scale. For example, if Z1,; and Z2,; 
are two i.i.d. samples from a normal(0, o?) distribution, then the probability 
that Z1,; < Z2,; does not depend on gf. For this reason we say that the di- 
agonal entries of X are non-identifiable in this model, meaning that the rank 
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likelihood provides no information about what the diagonal should be. In a 
Bayesian analysis, the posterior distribution of any non-identifiable parameter 
is determined by the prior distribution, and so in some sense the posterior dis- 
tribution of such a parameter is not of interest. However, to each covariance 
matrix X there corresponds a unique correlation matrix W, obtained by the 
function 


Y= h(Z) = {ci ;/ o?0?}. 


The value of W is identifiable from the rank likelihood, and so one estima- 
tion approach for the Gaussian copula model is to reparameterize the model 
in terms of a non-identifiable covariance matrix X, but focus our posterior 
inference on the identifiable correlation matrix ¥ = h(X). This technique of 
modeling in terms of a non-identifiable parameter in order to simplify calcula- 
tions is referred to as parameter expansion (Liu and Wu, 1999), and has been 
used in the context of modeling multivariate ordinal data by Hoff (2007) and 
Lawrence et al (2008). 
To summarize, we will base our posterior distribution on 


X ~ inverse-Wishart(vo, 89‘) (12.7) 
Zi,..-,Zn ~ iid. multivariate normal(0, X) 
Vig = 9(4i,5), 


but our estimation and inference will be restricted to YW = h(2). Interestingly, 
the posterior distribution for ¥ obtained from this prior and model is exactly 
the same as that which would be obtained from the following: 


X ~ inverse-Wishart(vp, So) (12.8) 
Y= h(S) 
Z1,.--,Zn ~ iid. multivariate normal(0, Y) 
Vig = g;(Zi,5). 


In other words, the non-identifiable model described in Equation 12.7 gives the 
same posterior distribution for ¥ as the identifiable model in Equation 12.8 in 
which the prior distribution for W is defined by {X ~ inverse-Wishart(v9,S9') , 
W = h(X)}. The only difference is that the Gibbs sampling scheme for Equa- 
tion 12.7 is easier to formulate. The equivalence of these two models relies on 
the scale invariance of the rank likelihood, and so will not generally hold for 
other types of models involving correlation matrices. 


Full conditional distribution of X 


If the prior distribution for X is inverse-Wishart(vo, Sp‘), then, as described 
in Section 7.3, the full conditional distribution of X is inverse-Wishart as well. 
We review this fact here by first noting that the probability density of the n x p 
matrix Z can be written as 
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n 
p(Z|Z) = | [ (2m)??? |E|- 2 exp{— 52157 125} 
i=l 


= (2n)-"P/?|S|-"/? exp{—tr(Z7 ZZ") /2}, 


where “tr(A)” stands for the trace of matrix A, which is the sum of the 
diagonal elements of A. The full conditional distribution p(X|Z, Z € R(Y)) = 
p(Z|Z) is then given by 


(XZ) x p(X) x p(Z|~) 
x [Z| Mot Pt)? exp{—tr(SoL1)/2} x ||"? exp{—tr(Z7 ZZ") /2} 
= |Z| otn] +e+1)/2 exp{—tr([Sq + ZZ] Z-1/2} 

which is proportional to an inverse-Wishart(vp + n, [So + Z’Z]~) density. 


Full conditional distribution of Z 


Recall from Section 5 of Chapter 7 that if Z is a random multivariate 
normal(0, X) vector, then the conditional distribution of Z}, given the other 
elements Z_; = Z_;, is a univariate normal distribution with mean and vari- 
ance given by 


E/Z,|2, 2-4) = Spgs 
Var[z |S eg) = Sg eg) es 


where X; —; refers to the jth row of X with the jth column removed, and 
»/_;,-; refers to X with both the jth row and column removed. If in addition 
we condition on the information that Z € R(Y), then we know that 


max{ Zx,j Uk < Yii} < Sij < min{ zk j : Yij < Yk,j }- 


These two pieces of information imply that the full conditional distribution 
of Z; j is a constrained normal distribution, which can be sampled from using 
the procedure described in the previous section and in the following R-code: 


Sz<— Sigma[j,—j]%*%solve (Sigma[—j,—j]) 


sz<— sqrt( Sigma[j,j] — Sjce%*%Sigma[—j ,j]) 
ez<— Z[i,—j]%*%t (Sjc) 


a<—max(Z[ Y[i,j]>Y[,j] , j ], na.rm=TRUE) 
b<min(Z[ Y[i,j]<Y[,j] , j ], na-.rm=TRUE) 


u<-runif(1, pnorm( (a-ez)/sz ), pnorm( (b—-ez)/sz ) ) 
Z[i,jJ<— ez + sz*qnorm(u) 
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Missing data 


The expression na.rm=TRUE in the above code allows for the possibility 
of missing data. Instead of throwing out the rows of the data matrix that 
contain some missing values, we would like to use all of the data we can. If 
the values are missing-at-random as described in Section 7.5, then they can 
simply be treated as unknown parameters and their values imputed using the 
Gibbs sampler. For the Gaussian copula model, this imputation happens at 
the level of the latent variables. For example, suppose that variable 7 for case 
i is not recorded, i.e. y; j is not available. As described above, the full condi- 
tional distribution of Z; ; given Z; —j is normal. If y; į were observed then the 
conditional distribution of Z; ; would be a constrained normal, as observing 
Yi,j imposes a constraint on the allowable values of Z; j. But if y;,; is miss- 
ing then no such constraint is imposed, and the full conditional distribution 
of Zi j is simply the original unconstrained normal distribution. The R-code 
above handles this as follows: If Y; ;j is missing, then Z[ Y[i,jJ>YLj] j] 


is a vector of missing values. The option na.rm=TRUE removes all of these 
missing values, so a is the maximum of an empty set, which is defined to be 
—oco. Similarly, b will be set to oo. 


Example: Social mobility data 


The results of the probit regression of DEG on the variables PDEG and 
CHILD in the last section indicate that the educational level of an individual 
is related to that of their parents. In this section we analyze this further, by 
examining the joint relationships among respondent-specific variables DEG, 
CHILD, INC along with analogous parent-specific variables PDEG, PCHILD 
and PINC. In this data analysis PDEG is a five-level categorical variable with 
the same levels as DEG, recording the highest degree of the respondent’s 
mother or father. PCHILD is the number of siblings of the respondent, and 
so is roughly the number of children of the respondent’s parents. The variable 
PINC is a five-level ordered categorical variable recording the respondent’s 
parent’s financial status when the respondent was 16 years of age. Finally, 
we also include AGE, the respondent’s age in years. Although not of primary 
interest, heterogeneity in a person’s income, number of children and degree 
category is likely to be related to age. 

Using an inverse-Wishart(p+ 2, (p+ 2) x I) prior distribution for X having 
a prior mean of E[X] = I, we can implement a Gibbs sampling algorithm using 
the full conditional distributions described above. Iterating the algorithm for 
25,000 scans, saving parameter values every 25th scan, gives a total of 1,000 
values of each parameter with which to approximate the posterior distribution 
of W = h( X). The Monte Carlo estimate of the posterior mean of W is 


222 12 Latent variable methods for ordinal data 


1.00 0.48 0.29 0.13 0.17 —0.05 0.34 
0.48 1.00 —0.04 0.20 0.46 —0.21 0.05 
0.29 —0.04 1.00 —0.15 —0.25 0.22 0.59 
EW|yy,---.¥n] = | 0.13 0.20-0.15 1.00 0.44 —0.22 —0.13 
0.17 0.46 —0.25 0.44 1.00 —0.29 —0.23 
—0.05 —0.21 0.22 —0.22 -0.29 1.00 0.12 
0.34 0.05 0.59 0.13 —0.23 0.12 1.00 


where the columns and rows are, in order, INC, DEG, CHILD, PINC, PDEG, 
PCHILD and AGE. We also may be interested in the “regression coefficients” 
B55 = Y,- (Z3, ;) 1, which for each variable j is a vector of length j — 1 
that describes how the conditional mean of Z; depends on the remaining 
variables Z_;. Figure 12.4 summarizes the posterior distributions of each 3 ing 
(except for that of AGE) as follows: A 95% quantile-based confidence interval 
is obtained for each 6; p. If the confidence interval does not contain zero, a 
line between variables j and k is drawn, with a “+” or a “—” indicating the 
sign of the posterior median. If the interval does contain zero, no line is drawn 
between the variables. 

Such a graph is sometimes referred to as a dependence graph, which sum- 
marizes the conditional dependencies among the variables. Roughly speaking, 
two variables in the graph are conditionally independent given the other vari- 
ables if there is no line between them. More precisely, the absence of a line 
indicates the lack of strong evidence of a conditional dependence. For exam- 
ple, although there is a positive marginal dependence between INC and PINC, 
the graph indicates that there is little evidence of any conditional dependence, 
given the other variables. 


PCHILD 


Fig. 12.4. Reduced conditional dependence graph for the GSS data. 
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12.3 Discussion and further references 


Normally distributed latent variables are often used to induce dependence 
among a set of non-normal observed variables. For example, Chib and Winkel- 
mann (2001) present a model for a vector of correlated count data in which 
each component is a Poisson random variable with a mean depending on a 
component-specific latent variable. Dependence among the count variables is 
induced by modeling the vector of latent variables with a multivariate normal 
distribution. Similar approaches are proposed by Dunson (2000) and described 
in Chapter 8 of Congdon (2003). Pitt et al (2006) discuss Bayesian inference 
for Gaussian copula models when the margins are known parametric families, 
and Quinn (2004) presents a factor analysis model for mixed continuous and 
discrete outcomes, in which the continuous variables are treated parametri- 
cally. 

Pettitt (1982) develops the rank likelihood to estimate parameters in a 
latent normal regression model, allowing the transformation from the latent 
data to continuous observed data to be treated nonparametrically. Hoff (2007) 
extends this type of likelihood to accommodate both continuous and discrete 
ordinal data, and provides a Gibbs sampler for parameter estimation in a 
semiparametric Gaussian copula model. 

The rank likelihood is based on the marginal distribution of the ranks, 
and so is called a marginal likelihood. Marginal likelihoods are typically con- 
structed so that they use the information in the data that depends only on 
the parameters of interest, and do not use any information that depends on 
nuisance parameters. Marginal likelihoods do not generally provide efficient 
estimation, as they throw away part of the information in the data. How- 
ever, they can turn a very difficult semiparametric estimation problem into 
essentially a parametric one. The use of marginal likelihoods in the context 
of Bayesian estimation is discussed in Monahan and Boos (1992). 


Exercises 


Chapter 2 


2.1 


2.2 


2.3 


2.4 


Marginal and conditional probability: The social mobility data from Sec- 

tion 2.5 gives a joint probability distribution on (Yi, Y2)= (father’s oc- 

cupation, son’s occupation). Using this joint distribution, calculate the 

following distributions: 

a) the marginal probability distribution of a father’s occupation; 

b) the marginal probability distribution of a son’s occupation; 

c) the conditional distribution of a son’s occupation, given that the father 
is a farmer; 

d) the conditional distribution of a father’s occupation, given that the 
son is a farmer. 

Expectations and variances: Let Yı and Yə be two independent random 

variables, such that E[Y;] = p; and Var[Y;] = o?. Using the definition of 

expectation and variance, compute the following quantities, where a; and 

az are given constants: 

a) Ela1Y, + a2Y3| 3 Var[a1Y, + a2Y9); 

b) Eļa Yı = a2Y9| 3 Var[a Yı = a2Y)}. 

Full conditionals: Let X, Y, Z be random variables with joint density (dis- 

crete or continuous) p(x, y, z) x f(a, z)g(y, z)h(z). Show that 

a) p(aly,z) x f(x, z), ie. p(aly, z) is a function of x and z; 

b) p(y|a, z) x g(y, z), i.e. p(y|a, z) is a function of y and z; 

c) X and Y are conditionally independent given Z. 

Symbolic manipulation: Prove the following form of Bayes’ rule: 


Pr(E|H;) Pr(A; 
Pr(H;|E) = nA | 5) r( j) 
Xp- Pr(E| Ax) Pr( Hx) 
where £ is any event and {H,,...,H} form a partition. Prove this using 


only axioms P1-P3 from this chapter, by following steps a)-d) below: 
a) Show that Pr(H,|£) Pr(£) = Pr(E|H;) Pr(H;). 


P.D. Hoff, A First Course in Bayesian Statistical Methods, 
Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6_BM2, 
© Springer Science+Business Media, LLC 2009 
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2.5 


2.6 


2.7 


2i 
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b) Show that Pr(E) = Pr(E N H1) + Pr(E N {UK Hy}). 

c) Show that Pr(E) = 7*_, Pr(E Hy). 

d) Put it all together to show Bayes’ rule, as described above. 

Urns: Suppose urn H is filled with 40% green balls and 60% red balls, and 
urn T is filled with 60% green balls and 40% red balls. Someone will flip 
a coin and then select a ball from urn H or urn T depending on whether 
the coin lands heads or tails, respectively. Let X be 1 or 0 if the coin lands 
heads or tails, and let Y be 1 or 0 if the ball is green or red. 

) Write out the joint distribution of X and Y in a table. 

b) Find E[Y]. What is the probability that the ball is green? 

) Find Var[Y|X = 0], Var[Y|X = 1] and Var[Y]. Thinking of variance as 
measuring uncertainty, explain intuitively why one of these variances 
is larger than the others. 

d) Suppose you see that the ball is green. What is the probability that 
the coin turned up tails? 

Conditional independence: Suppose events A and B are conditionally in- 

dependent given C, which is written ALB|C. Show that this implies that 

ACLB|C, ALB®|C, and A°1B°|C, where A® means “not A.” Find an 

example where AL B|C holds but A1B|C® does not hold. 

Coherence of bets: de Finetti thought of subjective probability as follows: 

Your probability p(E) for event E is the amount you would be willing to 

pay or charge in exchange for a dollar on the occurrence of E. In other 

words, you must be willing to 

e give p(£) to someone, provided they give you $1 if Æ occurs; 

e take p(E) from someone, and give them $1 if E occurs. 

Your probability for the event E° =“not E” is defined similarly. 

a) Show that it is a good idea to have p(F) < 1. 

b) Show that it is a good idea to have p(£) + p(E*) = 1. 
Interpretations of probability: One abstract way to define probability is 
via measure theory, in that Pr(-) is simply a “measure” that assigns mass 
to various events. For example, we can “measure” the number of times a 
particular event occurs in a potentially infinite sequence, or we can “mea- 
sure” our information about the outcome of an unknown event. The above 
two types of measures are combined in de Finetti’s theorem, which tells 
us that an exchangeable model for an infinite binary sequence Yj, Yo,... 
is equivalent to modeling the sequence as conditionally i.i.d. given a pa- 
rameter 0, where Pr(@ < c) represents our information that the long-run 
frequency of 1’s is less than c. With this in mind, discuss the different 
ways in which probability could be interpreted in each of the following 
scenarios. Avoid using the word “probable” or “likely” when describing 
probability. Also discuss the different ways in which the events can be 
thought of as random. 

a) The distribution of religions in Sri Lanka is 70% Buddhist, 15% Hindu, 

8% Christian, and 7% Muslim. Suppose each person can be identified 
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by a number from 1 to K on a census roll. A number xv is to be 

sampled from {1,..., K} using a pseudo-random number generator 

on a computer. Interpret the meaning of the following probabilities: 
i. Pr(person x is Hindu); 

ii. Pr(a = 6452859); 

iii. Pr(Person x is Hindu|a=6452859). 


b) A quarter which you got as change is to be flipped many times. Inter- 

pret the meaning of the following probabilities: 
i. Pr(6, the long-run relative frequency of heads, equals 1/3); 

ii. Pr(the first coin flip will result in a heads); 
iii. Pr(the first coin flip will result in a heads | 8 = 1/3). 

c) The quarter above has been flipped, but you have not seen the out- 
come. Interpret Pr(the flip has resulted in a heads). 

Chapter 3 


3.1 Sample survey: Suppose we are going to sample 100 individuals from 
a county (of size much larger than 100) and ask each sampled person 
whether they support policy Z or not. Let Y; = 1 if person i in the sample 
supports the policy, and Y; = 0 otherwise. 


3:2 


a) 


= 


oO 
Nas 


Assume Yj,...,Y¥109 are, conditional on 9, i.i.d. binary random vari- 
ables with expectation 8. Write down the joint distribution of Pr(Y, = 
Y1,---,Y100 = Y100/@) in a compact form. Also write down the form of 
Pr(S> Y; = ylé). 

For the moment, suppose you believed that 0 € {0.0,0.1,...,0.9, 1.0}. 
Given that the results of the survey were Yoy = 57, compute 
Pr(X Y; = 57|0) for each of these 11 values of 0 and plot these prob- 
abilities as a function of 0. 

Now suppose you originally had no prior information to believe one of 
these -values over another, and so Pr(0 = 0.0) = Pr(@ = 0.1) 

Pr(0 = 0.9) = Pr(0 = 1.0). Use Bayes’ rule to compute p(6| X; Yi = 
57) for each 0-value. Make a plot of this posterior distribution as a 
function of 6. 

Now suppose you allow @ to be any value in the interval [0,1]. Using 
the uniform prior density for 6, so that p(@) = 1, plot the posterior 
density p(@) x Pr(S>i_, Yı = 57|0) as a function of 8. 

As discussed in this chapter, the posterior distribution of @ is beta(1+ 
57, 1+ 100 — 57). Plot the posterior density as a function of 8. Discuss 
the relationships among all of the plots you have made for this exercise. 


Sensitivity analysis: It is sometimes useful to express the parameters a 
and b in a beta distribution in terms of 09 = a/(a + b) and no = a + b, 
so that a = ono and b = (1 — 69)no. Reconsidering the sample survey 
data in Exercise 3.1, for each combination of 0) € {0.1,0.2,...,0.9} and 
no E {1,2,8, 16, 32} find the corresponding a, b values and compute Pr(6 > 
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0.5] X Y; = 57) using a beta(a,b) prior distribution for 0. Display the 
results with a contour plot, and discuss how the plot could be used to 
explain to someone whether or not they should believe that 0 > 0.5, 
based on the data that 55:29 Y; = 57. 

Tumor counts: A cancer laboratory is estimating the rate of tumorigenesis 
in two strains of mice, A and B. They have tumor count data for 10 mice 
in strain A and 13 mice in strain B. Type A mice have been well studied, 
and information from other laboratories suggests that type A mice have 
tumor counts that are approximately Poisson-distributed with a mean of 
12. Tumor count rates for type B mice are unknown, but type B mice are 
related to type A mice. The observed tumor counts for the two populations 


are 
ya = (12,9, 12, 14, 13, 13, 15,8, 15,6); 


Yp = (11,11, 10,9, 9, 8,7, 10,6, 8, 8, 9, 7). 


a) Find the posterior distributions, means, variances and 95% quantile- 
based confidence intervals for 04 and 0g, assuming a Poisson sampling 
distribution for each group and the following prior distribution: 


0A T gamma(120,10), OB ited gamma(12,1), p(O4, Oz) = p(Oa) x p(Oz). 


b) Compute and plot the posterior expectation of 0g under the prior dis- 
tribution 6g ~ gamma(12xno, no) for each value of no € {1,2,..., 50}. 
Describe what sort of prior beliefs about 0g would be necessary in or- 
der for the posterior expectation of 0p to be close to that of 04. 

c) Should knowledge about population A tell us anything about popu- 
lation B? Discuss whether or not it makes sense to have p(64,98) = 
p(0a) x p(x). 

Mixtures of beta priors: Estimate the probability 0 of teen recidivism 

based on a study in which there were n = 43 individuals released from 

incarceration and y = 15 re-offenders within 36 months. 

a) Using a beta(2,8) prior for 0, plot p(0), p(y|@) and p(@|y) as functions 
of 0. Find the posterior mean, mode, and standard deviation of 0. 
Find a 95% quantile-based confidence interval. 

b) Repeat a), but using a beta(8,2) prior for 0. 

c) Consider the following prior distribution for 8: 


1 (10) 
= 4T(2)I(8) 


(30(1 — 0)’ + 07(1 — 6)], 


which is a 75-25% mixture of a beta(2,8) and a beta(8,2) prior distri- 
bution. Plot this prior distribution and compare it to the priors in a) 
and b). Describe what sort of prior opinion this may represent. 
d) For the prior in c): 
i. Write out mathematically p(0) x p(y|@) and simplify as much as 
possible. 


Exercises 229 


ii. The posterior distribution is a mixture of two distributions you 
know. Identify these distributions. 

iii. On a computer, calculate and plot p(@) x p(y|@) for a variety of 0 
values. Also find (approximately) the posterior mode, and discuss 
its relation to the modes in a) and b). 

e) Find a general formula for the weights of the mixture distribution in 
d)ii, and provide an interpretation for their values. 

3.5 Mixtures of conjugate priors: Let p(y|d) = c(¢d)h(y) exp{dt(y)} be an 
exponential family model and let pi(¢),...p«%(¢) be K different members 
of the conjugate class of prior densities given in Section 3.3. A mixture of 
conjugate priors is given by p(@) = Sy wrpr(O), where the wx’s are all 
greater than zero and ` wp = 1 (see also Diaconis and Ylvisaker (1985)). 

a) Identify the general form of the posterior distribution of 0, based on 
n i.i.d. samples from p(y|@) and the prior distribution given by p. 

b) Repeat a) but in the special case that p(y|@) = dpois(y,@) and 
Pi,---,PK are gamma densities. 

3.6 Exponential family expectations: Let p(y|¢) = c(d)h(y) exp{¢t(y)} be an 
exponential family model. 

a) Take derivatives with respect to @ of both sides of the equation 
J r(yld) dy = 1 to show that Elé(¥)|4] = —¢'(¢4)/e(9). 

b) Let p(o) x c(¢)"°e"°'e? be the prior distribution for ¢. Calculate 
dp(@)/d@ and, using the fundamental theorem of calculus, discuss 
what must be true so that E[—c(¢)/c(¢)] = to. 

3.7 Posterior prediction: Consider a pilot study in which nı = 15 children 
enrolled in special education classes were randomly selected and tested 
for a certain type of learning disability. In the pilot study, yı = 2 children 
tested positive for the disability. 

a) Using a uniform prior distribution, find the posterior distribution of 
0, the fraction of students in special education classes who have the 
disability. Find the posterior mean, mode and standard deviation of 
0, and plot the posterior density. 

Researchers would like to recruit students with the disability to partici- 

pate in a long-term study, but first they need to make sure they can recruit 

enough students. Let ng = 278 be the number of children in special edu- 
cation classes in this particular school district, and let Yə be the number 
of students with the disability. 

b) Find Pr(¥2 = y2|Y1 = 2), the posterior predictive distribution of Y2, 
as follows: 

i. Discuss what assumptions are needed about the joint distribution 
of (Y1, Y2) such that the following is true: 


1 
0 


ii. Now plug in the forms for Pr(Y2 = y2|0) and p(0|Y1 = 2) in the 
above integral. 
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iii. Figure out what the above integral must be by using the calculus 
result discussed in Section 3.1. 

c) Plot the function Pr(Y2 = y2|Y1 = 2) as a function of y2. Obtain the 
mean and standard deviation of Y>, given Yı = 2. 

d) The posterior mode and the MLE (maximum likelihood estimate; see 
Exercise 3.14) of 0, based on data from the pilot study, are both 
§ = 2/15. Plot the distribution Pr(Y2 = y2|9 = 6), and find the mean 
and standard deviation of Y> given 0 = 6. Compare these results to 
the plots and calculations in c) and discuss any differences. Which 
distribution for Y2 would you use to make predictions, and why? 

Coins: Diaconis and Ylvisaker (1985) suggest that coins spun on a flat 

surface display long-run frequencies of heads that vary from coin to coin. 

About 20% of the coins behave symmetrically, whereas the remaining 

coins tend to give frequencies of 1/3 or 2/3. 

a) Based on the observations of Diaconis and Ylvisaker, use an appro- 
priate mixture of beta distributions as a prior distribution for 0, the 
long-run frequency of heads for a particular coin. Plot your prior. 

b) Choose a single coin and spin it at least 50 times. Record the number 
of heads obtained. Report the year and denomination of the coin. 

c) Compute your posterior for 0, based on the information obtained in 
b). 

d) Repeat b) and c) for a different coin, but possibly using a prior for 
0 that includes some information from the first coin. Your choice of 
a new prior may be informal, but needs to be justified. How the re- 
sults from the first experiment influence your prior for the 0 of the 
second coin may depend on whether or not the two coins have the 
same denomination, have a similar year, etc. Report the year and 
denomination of this coin. 

Galenshore distribution: An unknown quantity Y has a Galenshore(a, 0) 

distribution if its density is given by 


2 2,2 
= 029 2a—1,—O-y 
Ply) Cas 


for y > 0, 0 > 0 and a > 0. Assume for now that a is known. For this 

density, 

E[y?] = £. 

62 

a) Identify a class of conjugate prior densities for 0. Plot a few members 
of this class of densities. 

b) Let Yi,..., Y, ~i-i.d. Galenshore(a, 0). Find the posterior distribution 
of 0 given Yj,...,Yn, using a prior from your conjugate class. 

c) Write down p(6a|Y1,---; Yn)/p(@o|¥1,---; Yn) and simplify. Identify a 
sufficient statistic. 

d) Determine E[@|y1,..., Yn]. 


3.10 


3.12 


3.14 
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e) Determine the form of the posterior predictive density p(G|y1.--, Yn). 
Change of variables: Let Y = g(@), where g is a monotone function of 6, 
and let h be the inverse of g so that 6 = h(w). If pọ(0) is the probability 
density of 0, then the probability density of w induced by pg is given by 
pul) = pa(h(h)) x IB. 

a) Let 0 ~ beta(a, b) and let Y = log[6/(1 — 0)]. Obtain the form of py 

and plot it for the case that a = b = 1. 

b) Let 0 ~ gamma(a, b) and let y = log. Obtain the form of py and 
plot it for the case that a = b= 1. 

Jeffreys’ prior: Jeffreys (1961) suggested a default rule for generating a 

prior distribution of a parameter 0 in a sampling model p(y|@). Jeffreys’ 

prior is given by p;(0) « ./1(0) , where I(@) = —E[0? log p(Y |0) /067|6| 
is the Fisher information. 

a) Let Y ~ binomial(n, 0). Obtain Jeffreys’ prior distribution p ;(0) for 
this model. 

b) Reparameterize the binomial sampling model with Y% = log @/(1 — 0), 
so that p(y) = (7) e¥¥(1+e”)—". Obtain Jeffreys’ prior distribution 
pz(w) for this model. 

c) Take the prior distribution from a) and apply the change of variables 
formula from Exercise 3.10 to obtain the induced prior density on W. 
This density should be the same as the one derived in part b) of this 
exercise. This consistency under reparameterization is the defining 
characteristic of Jeffrey’s’ prior. 

Improper Jeffreys’ prior: Let Y ~ Poisson(6). 

a) Apply Jeffreys’ procedure to this model, and compare the result to the 
family of gamma densities. Does Jeffreys’ procedure produce an actual 
probability density for 0? In other words, can \/I(@) be proportional 
to an actual probability density for 6 € (0,00)? 

b) Obtain the form of the function f(0,y) = \/I(@) x p(y|@). What 
probability density for 0 is f(@,y) proportional to? Can we think of 
f(0,y)/ f f(0,y)d0 as a posterior density of 0 given Y = y? 

Unit information prior: Let Y1,..., Yp ~ iid. p(y|@). Having observed 

the values Yı = y1,..., Yp = Yn, the log likelihood is given by I(@|y) = 

S` log p(y;|@), and the value 6 of @ that maximizes I(O|y) is called the 
maximum likelihood estimator. The negative of the curvature of the log- 
likelihood, J(0) = —0?1(6|y)/00?, describes the precision of the MLE 6 
and is called the observed Fisher information. For situations in which it 
is difficult to quantify prior information in terms of a probability distri- 
bution, some have suggested that the “prior” distribution be based on 
the likelihood, for example, by centering the prior distribution around the 
MLE Ô. To deal with the fact that the MLE is not really prior information, 
the curvature of the prior is chosen so that it has only “one nth” as much 
information as the likelihood, so that —0? log p(@)/00? = J(@)/n. Such a 
prior is called a unit information prior (Kass and Wasserman, 1995; Kass 
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and Raftery, 1995), as it has as much information as the average amount 

of information from a single observation. The unit information prior is 

not really a prior distribution, as it is computed from the observed data. 

However, it can be roughly viewed as the prior information of someone 

with weak but accurate prior information. 

a) Let Yj,...,¥_ ~ iid. binary(). Obtain the MLE 6 and J(6)/n. 

b) Find a probability density py(@) such that log py(@) = U(Oly)/n + 
c, where c is a constant that does not depend on 0. Compute the 
information —0? log py (0)/00? of this density. 

c) Obtain a probability density for @ that is proportional to py(@) x 
P(Y1,--+,Yn|0). Can this be considered a posterior distribution for 0? 

d) Repeat a), b) and c) but with p(y|@) being the Poisson distribution. 


Chapter 4 


4.1 


4.2 


4.3 


Posterior comparisons: Reconsider the sample survey in Exercise 3.1. Sup- 

pose you are interested in comparing the rate of support in that county to 

the rate in another county. Suppose that a survey of sample size 50 was 
done in the second county, and the total number of people in the sample 

who supported the policy was 30. Identify the posterior distribution of 02 

assuming a uniform prior. Sample 5,000 values of each of 6; and 62 from 

their posterior distributions and estimate Pr(@1 < 02|the data and prior). 

Tumor count comparisons: Reconsider the tumor count data in Exercise 

Boy 

a) For the prior distribution given in part a) of that exercise, obtain 
Pr(0s < alya, Yp) via Monte Carlo sampling. 

b) For a range of values of no, obtain Pr(0g < Oalyy4,Yp) for 04 ~ 
gamma(120, 10) and 0g ~ gamma(12 x no, no). Describe how sensitive 
the conclusions about the event {0p < 04} are to the prior distribution 
on Op. 

c) Repeat parts a) and b), replacing the event {9g < 64} with the event 
{Yep < Ya}, where Y4 and Ypg are samples from the posterior predic- 
tive distribution. 

Posterior predictive checks: Let’s investigate the adequacy of the Pois- 

son model for the tumor count data. Following the example in Section 


4.4, generate posterior predictive datasets ys, ee 0), Each y is a 


sample of size n4 = 10 from the Poisson distribution with parameter g's), 
ge) is itself a sample from the posterior distribution p(@4|y4), and y4 is 
the observed data. 

a) For each s, let t(°) be the sample average of the 10 values of y, 
divided by the sample standard deviation of y®. Make a histogram 
of t's) and compare to the observed value of this statistic. Based on 
this statistic, assess the fit of the Poisson model for these data. 


4.4 


4.5 
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b) Repeat the above goodness of fit evaluation for the data in population 
B. 
Mixtures of conjugate priors: For the posterior density from Exercise 3.4: 


a) Make a plot of p(@|y) or p(y|@)p(@) using the mixture prior distribution 
and a dense sequence of 6-values. Can you think of a way to obtain 
a 95% quantile-based posterior confidence interval for 0? You might 
want to try some sort of discrete approximation. 

b) To sample a random variable z from the mixture distribution wp1(z)+ 
(1—w)po(z), first toss a w-coin and let x be the outcome (this can be 
done in R with x<—rbinom(1,1,w) ). Then if x = 1 sample z from pı, 
and if x = 0 sample z from po. Using this technique, obtain a Monte 
Carlo approximation of the posterior distribution p(@|y) and a 95% 
quantile-based confidence interval, and compare them to the results 
in part a). 

Cancer deaths: Suppose for a set of counties i € {1,...,} we have infor- 

mation on the population size X; = number of people in 10,000s, and Y; = 

number of cancer fatalities. One model for the distribution of cancer fa- 
talities is that, given the cancer rate 0, they are independently distributed 
with Y; ~ Poisson(0X;). 

a) Identify the posterior distribution of 6 given data (Y1, X1),..., (Yn, Xn) 
and a gamma(a, b) prior distribution. 

The file cancer_react.dat contains 1990 population sizes (in 10,000s) 
and number of cancer fatalities for 10 counties in a Midwestern state 
that are near nuclear reactors. The file cancer_noreact.dat contains the 
same data on counties in the same state that are not near nuclear reactors. 
Consider these data as samples from two populations of counties: one is 
the population of counties with no neighboring reactors and a fatality rate 
of 6, deaths per 10,000, and the other is a population of counties having 
nearby reactors and a fatality rate of 02. In this exercise we will model 
beliefs about the rates as independent and such that 6; ~ gamma(az, b1) 
and 62 ~ gamma(ag, b2). 

b) Using the numerical values of the data, identify the posterior distri- 
butions for 6; and @2 for any values of (a1, b1, a2, b2). 

c) Suppose cancer rates from previous years have been roughly 6=2.2 
per 10,000 (and note that most counties are not near reactors). 
For each of the following three prior opinions, compute E[6;|data], 
E[62|data], 95% quantile-based posterior intervals for 6; and 62, and 
Pr(@2 > 6,\data). Also plot the posterior densities (try to put p(0ı |data) 
and p(62|data) on the same plot). Comment on the differences across 
posterior opinions. 

i. Opinion 1: (a, = ag = 2.2 x 100, bı = b2 = 100). Cancer rates for 
both types of counties are similar to the average rates across all 
counties from previous years. 
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ii. Opinion 2: (ay = 2.2 x 100,b; = 100,a2 = 2.2,b; = 1). Cancer 
rates in this year for nonreactor counties are similar to rates in 
previous years in nonreactor counties. We don’t have much in- 
formation on reactor counties, but perhaps the rates are close to 
those observed previously in nonreactor counties. 

iii. Opinion 3: (a; = ag = 2.2,b; = b2 = 1). Cancer rates in this year 
could be different from rates in previous years, for both reactor 
and nonreactor counties. 

d) In the above analysis we assumed that population size gives no infor- 
mation about fatality rate. Is this reasonable? How would the analysis 
have to change if this is not reasonable? 

e) We encoded our beliefs about 6; and 62 such that they gave no in- 
formation about each other (they were a priori independent). Think 
about why and how you might encode beliefs such that they were a 
priori dependent. 

Non-informative prior distributions: Suppose for a binary sampling prob- 

lem we plan on using a uniform, or beta(1,1), prior for the population 

proportion 6. Perhaps our reasoning is that this represents “no prior in- 
formation about 0.” However, some people like to look at proportions on 
the log-odds scale, that is, they are interested in y = log <4. Via Monte 

Carlo sampling or otherwise, find the prior distribution for y that is in- 

duced by the uniform prior for 0. Is the prior informative about y? 

Mixture models: After a posterior analysis on data from a population of 

squash plants, it was determined that the total vegetable weight of a given 

plant could be modeled with the following distribution: 


p(y|0,07) = .31dnorm(y, 6,0) + .46dnorm(26,, 20) + .23dnorm(y, 301,30) 


where the posterior distributions of the parameters have been calculated 
as 1/o? ~ gamma(10, 2.5), and 6|o? ~ normal(4.1, 07/20). 
a) Sample at least 5,000 y values from the posterior predictive distribu- 
tion. 
b) Form a 75% quantile-based confidence interval for a new value of Y. 
c) Form a 75% HPD region for a new Y as follows: 
i. Compute estimates of the posterior density of Y using the density 
command in R, and then normalize the density values so they sum 
to 1. 

ii. Sort these discrete probabilities in decreasing order. 

iii. Find the first probability value such that the cumulative sum of 
the sorted values exceeds 0.75. Your HPD region includes all values 
of y which have a discretized probability greater than this cutoff. 
Describe your HPD region, and compare it to your quantile-based 
region. 

d) Can you think of a physical justification for the mixture sampling 

distribution of Y? 
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4.8 More posterior predictive checks: Let 04 and @p be the average num- 
ber of children of men in their 30s with and without bachelor’s degrees, 
respectively. 

a) Using a Poisson sampling model, a gamma(2,1) prior for each 8 and the 
data in the files menchild30bach.dat and menchild30nobach.dat, 
obtain 5,000 samples of Y4 and Yg from the posterior predictive dis- 
tribution of the two samples. Plot the Monte Carlo approximations to 
these two posterior predictive distributions. 

b) smd 95% quantile-based posterior confidence intervals for 0g —04 and 

—Y4. Describe in words the differences between the two populations 
nm these quantities and the plots in a), along with any other results 
that may be of interest to you. 

c) Obtain the empirical distribution of the data in group B. Compare 
this to the Poisson distribution with mean 6 = 1.4. Do you think the 
Poisson model is a good fit? Why or why not? 

d) For each of the 5,000 0 g-values you sampled, sample ng = 218 Poisson 
random variables and count the number of 0s and the number of 1s 
in each of the 5,000 simulated datasets. You should now have two 
sequences of length 5,000 each, one sequence counting the number of 
people having zero children for each of the 5,000 posterior predictive 
datasets, the other counting the number of people with one child. 
Plot the two sequences against one another (one on the z-axis, one 
on the y-axis). Add to the plot a point marking how many people in 
the observed dataset had zero children and one child. Using this plot, 
describe the adequacy of the Poisson model. 


Chapter 5 


5.1 Studying: The files school1.dat, school2.dat and schoo13.dat contain 
data on the amount of time students from three high schools spent on 
studying or homework during an exam period. Analyze data from each of 
these schools separately, using the normal model with a conjugate prior 
distribution, in which {40 = 5,02 = 4, Ko = 1, vo = 2} and compute or 
approximate the following: 

a) posterior means and 95% confidence intervals for the mean 6 and 
standard deviation o from each school; 

b) the posterior probability that 0; < 0; < 6, for all six permutations 
{i,j,k} of {1, 2,3}; 

c) the posterior probability that Y; < Y; < Ý, for all six permutations 
{i,j,k} of {1,2,3}, where Y; is a sample from the posterior predictive 
distribution of school i. 

d) Compute the posterior probability that_9; is bigger than both @2 and 
63, and the posterior probability that Ý; is bigger than both Y> and 
Y3. 
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Sensitivity analysis: Thirty-two students in a science classroom were 
randomly assigned to one of two study methods, A and B, so that 
na = npg = 16 students were assigned to each method. After several 
weeks of study, students were examined on the course material with an 
exam designed to give an average score of 75 with a standard deviation of 
10. The scores for the two groups are summarized by {4 = 75.2, s4 = 7.3} 
and {JB = 77.5, sp = 8.1}. Consider independent, conjugate normal prior 
distributions for each of 04 and 0g, with wo = 75 and o = 100 for 
both groups. For each (ko, vo) € {(1,1),(2,2),(4,4),(8,8),(16,16),(32,32)} 
(or more values), obtain Pr(04 < 0B|Y 4, Ypg) via Monte Carlo sampling. 
Plot this probability as a function of (ko = vo). Describe how you might 
use this plot to convey the evidence that 04 < Opg to people of a variety 
of prior opinions. 

Marginal distributions: Given observations Yj,..., Yp ~ i.i.d. normal(6, o?) 
and using the conjugate prior distribution for @ and o?, derive the formula 
for p(@|y1,---,Yn), the marginal posterior distribution of 0, conditional 
on the data but marginal over o?. Check your work by comparing your 
formula to a Monte Carlo estimate of the marginal distribution, using 
some values of Y1,...,¥n, Ho, 02, vo and Ko that you choose. Also derive 
p(o7|y1,--.; Yn), where °? = 1/o? is the precision. 

Jeffreys’ prior: For sampling models expressed in terms of a p-dimensional 
vector Y, Jeffreys’ prior (Exercise 3.11) is defined as pj(b) x /|I(p)], 
where |I(w)| is the determinant of the p x p matrix I(a) having entries 
I (wb)k,t = —E[O? log p(Y |p) /OpxOvi].- 

a) Show that Jeffreys’ prior for the normal model is p7(0, 07) « (07) 

b) Let y = (y1,---,Yn) be the observed values of an i.i.d. sample from a 
normal(6, o?) population. Find a probability density p7(0,07|y) such 
that p7(0,07|y) x p7(0,07)p(y|@, 07). It may be convenient to write 
this joint density as p;(6|o7, y) x pz(o7|y). Can this joint density be 
considered a posterior density? 

Unit information prior: Obtain a unit information prior for the normal 
model as follows: 

a) Reparameterize the normal model as p(y|6, Y), where y = 1/07. Write 
out the log likelihood 1(0, w|y) = ` log p(y;|9, Y) in terms of 0 and ¢. 

b) Find a probability density py (0, Y) so that log pu (0, Y) = 1(0, wly)/n 
+ c, where c is a constant that does not depend on 0 or w. Hint: Write 
Dui — 9)? as Elui -7+7 8)? = Kyi — 9)? +n(0 — g)?, and recall 
that log pu (0, Y) = log pu (4|y) + log pu (Y). 

c) Find a probability density py (8, w|y) that is proportional to py (0, Y) x 
P(Y1,---5Yn|0,w). It may be convenient to write this joint density as 
pu (O|w, y) x pu (wy). Can this joint density be considered a posterior 
density? 
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6.1 


6.2 


6.3 


Poisson population comparisons: Let’s reconsider the number of children 

data of Exercise 4.8. We’ll assume Poisson sampling models for the two 

groups as before, but now we’ll parameterize 04 and 0g as 04 = 0, 0B = 

0 x y. In this parameterization, y represents the relative rate 08/04. Let 
0 ~ gamma(ag, bg) and let y ~ gamma(ay, by). 

a) Are #4 and pg independent or dependent under this prior distribu- 
tion? In what situations is such a joint prior distribution justified? 
Obtain the form of the full conditional distribution of 0 given y4, yp 
and y. 

Obtain the form of the full conditional distribution of y given y4, Yp 
and @. 

d) Set ag = 2 and bọ = 1. Let a, = b, € {8, 16,32, 64, 128}. For each of 
these five values, run a Gibbs sampler of at least 5,000 iterations and 
obtain E[6p —0aly,4, Ypg]. Describe the effects of the prior distribution 
for y on the results. 

Mixture model: The file glucose.dat contains the plasma glucose con- 

centration of 532 females from a study on diabetes (see Exercise 7.6). 

a) Make a histogram or kernel density estimate of the data. Describe 

how this empirical distribution deviates from the shape of a normal 
distribution. 

Consider the following mixture model for these data: For each study 

participant there is an unobserved group membership variable X; 
which is equal to 1 or 2 with probability p and 1 — p. If X; = 1 
then Y; ~ normal(61, 07), and if X; = 2 then Y; ~ normal(62, 02). Let 
p ~ beta(a, b), 0; ~ normal(io, 79) and 1/o; ~ gamma(vo/2, voog /2) 
for both j = 1 and j = 2. Obtain the full conditional distributions of 
(X1,. . SAn); P, 01, 02, o? and g5. 

Setting a = b = 1, wo = 120, rê = 200, o2 = 1000 and vo = 10, 
implement the Gibbs sampler for at least 10,000 iterations. Let 


a = min{o{*, 6S} and Oy = max{0"), 0}. Compute and plot 


the autocorrelation functions of a and ee as well as their effective 


Z 


fo) 
Near? 


= 


fo) 
Nee? 


sample sizes. 

For each iteration s of the Gibbs sampler, sample a value x ~ 
binary(p)), then sample Y) ~ normal(6$), 02%). Plot a his- 
togram or kernel density estimate for the empirical distribution of 
Y,...,Y¥(S), and compare to the distribution in part a). Discuss 
the adequacy of this two-component mixture model for the glucose 
data. 

Probit regression: A panel study followed 25 married couples over a pe- 
riod of five years. One item of interest is the relationship between divorce 
rates and the various characteristics of the couples. For example, the re- 
searchers would like to model the probability of divorce as a function of 


= 


238 Exercises 


age differential, recorded as the man’s age minus the woman’s age. The 
data can be found in the file divorce.dat. We will model these data with 
probit regression, in which a binary variable Y; is described in terms of 
an explanatory variable x; via the following latent variable model: 


Zi = Bx; + €; 
Y; = lco) (Zi); 


where 8 and c are unknown coefficients, €1,...,€n ~ ii.d. normal(0, 1) 

and 0(c,.0)(z) = 1 if z > c and equals zero otherwise. 

a) Assuming 8 ~ normal(0,73) obtain the full conditional distribution 
p(Bly, £, z,c). 

b) Assuming c ~ normal(0, 72), show that p(cly, x, z, 3) is a constrained 
normal density, i.e. proportional to a normal density but constrained 
to lie in an interval. Similarly, show that p(z;|y, £, z—i,ß8,c) is pro- 
portional to a normal density but constrained to be either above c or 
below c, depending on yi. 

c) Letting 73 = 77 = 16 , implement a Gibbs sampling scheme that ap- 
proximates the joint posterior distribution of Z, 8, and c (a method 
for sampling from constrained normal distributions is outlined in Sec- 
tion 12.1.1). Run the Gibbs sampler long enough so that the effective 
sample sizes of all unknown parameters are greater than 1,000 (includ- 
ing the Z;’s). Compute the autocorrelation function of the parameters 
and discuss the mixing of the Markov chain. 

d) Obtain a 95% posterior confidence interval for 3, as well as Pr(@ > 


Oly, a). 


Chapter 7 


7.1 Jeffreys’ prior: For the multivariate normal model, Jeffreys’ rule for gen- 
erating a prior distribution on (0, X) gives p7(0, X) x |X| (P+?)/?, 

a) Explain why the function py cannot actually be a probability density 
for (0, X). 

b) Let p7(@, X|Y1,.--, Yn) be the probability density that is proportional 
to p7(O, ©) xp(yy,---,Y,|8, X). Obtain the form of p7(O, L|y,,---,Yn); 
ps(O|X,y4,- eti Yn) and pr(L|y1, arabs sUn): 

7.2 Unit information prior: Letting Y = X71, show that a unit information 
prior for (0, Y) is given by 0|W ~ multivariate normal(ğ, #7!) and Y ~ 
Wishart(p+ 1,87), where S = $ (y; — J) (y; — 9)" /n. This can be done 
by mimicking the procedure outlined in Exercise 5.6 as follows: 

a) Reparameterize the multivariate normal model in terms of the pre- 
cision matrix YW = Xl, Write out the resulting log likelihood, 
and find a probability density py (0, W) = pu (@|W)pu(W) such that 
log p(0,Y) = 1(0,Y|/Y)/n + c, where c does not depend on @ or W. 


b) 
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Hint: Write (y; — 0) as (y; — Y +9 — 0), and note that $` a? Ba; can 
be written as tr(AB), where A = >> aj;a?. 

Let py (X) be the inverse-Wishart density induced by py(W). Obtain a 
density pu (9, 21; bank Yn) xX pu (0| X)pu(¥)p(y:, Pace ,Ynlð, X). Can 
this be interpreted as a posterior distribution for 0 and X? 


7.3 Australian crab data: The files bluecrab.dat and orangecrab.dat con- 
tain measurements of body depth (Y1) and rear width (Y2), in millimeters, 
made on 50 male crabs from each of two species, blue and orange. We will 
model these data using a bivariate normal distribution. 


a) 


For each of the two species, obtain posterior distributions of the pop- 
ulation mean @ and covariance matrix X as follows: Using the semi- 
conjugate prior distributions for @ and 3’, set o equal to the sample 
mean of the data, Ag and So equal to the sample covariance matrix 
and vo = 4. Obtain 10,000 posterior samples of 0 and X. Note that 
this “prior” distribution loosely centers the parameters around empir- 
ical estimates based on the observed data (and is very similar to the 
unit information prior described in the previous exercise). It cannot 
be considered as our true prior distribution, as it was derived from 
the observed data. However, it can be roughly considered as the prior 
distribution of someone with weak but unbiased information. 

Plot values of 0 = (01, 62)’ for each group and compare. Describe any 
size differences between the two groups. 

From each covariance matrix obtained from the Gibbs sampler, ob- 
tain the corresponding correlation coefficient. From these values, plot 
posterior densities of the correlations ppiuc and fPorange for the two 
groups. Evaluate differences between the two species by comparing 
these posterior distributions. In particular, obtain an approximation 
to Pr(pPbine < Porange|Ypiue> Yorange): What do the results suggest about 
differences between the two populations? 


7.4 Marriage data: The file agehw.dat contains data on the ages of 100 mar- 
ried couples sampled from the U.S. population. 


a) 


b) 


Before you look at the data, use your own knowledge to formulate a 
semiconjugate prior distribution for @ = (0p, 0w)? and X, where 0p, Ow 
are mean husband and wife ages, and X is the covariance matrix. 
Generate a prior predictive dataset of size n = 100, by sampling (6, X) 
from your prior distribution and then simulating Y1,..., Yn ~ iid. 
multivariate normal(6, X). Generate several such datasets, make bi- 
variate scatterplots for each dataset, and make sure they roughly rep- 
resent your prior beliefs about what such a dataset would actually 
look like. If your prior predictive datasets do not conform to your be- 
liefs, go back to part a) and formulate a new prior. Report the prior 
that you eventually decide upon, and provide scatterplots for at least 
three prior predictive datasets. 

Using your prior distribution and the 100 values in the dataset, ob- 
tain an MCMC approximation to p(0, X|Y1,---,Y100). Plot the joint 
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posterior distribution of 6, and @,,, and also the marginal posterior 
density of the correlation between Y, and Y,,, the ages of a husband 
and wife. Obtain 95% posterior confidence intervals for 6n, Ow and the 
correlation coefficient. 
Obtain 95% posterior confidence intervals for 0n, Ow and the correla- 
tion coefficient using the following prior distributions: 

i. Jeffreys’ prior, described in Exercise 7.1; 

ii. the unit information prior, described in Exercise 7.2; 
iii. a “diffuse prior” with po = 0, Ag = 10° x I, Sọ = 1000 x I and 

Yo = 3: 

Compare the confidence intervals from d) to those obtained in c). 
Discuss whether or not you think that your prior information is helpful 
in estimating 0 and X, or if you think one of the alternatives in d) 
is preferable. What about if the sample size were much smaller, say 
n = 25? 


7.5 Imputation: The file interexp.dat contains data from an experiment that 
was interrupted before all the data could be gathered. Of interest was the 
difference in reaction times of experimental subjects when they were given 
stimulus A versus stimulus B. Each subject is tested under one of the two 
stimuli on their first day of participation in the study, and is tested under 
the other stimulus at some later date. Unfortunately the experiment was 
interrupted before it was finished, leaving the researchers with 26 subjects 
with both A and B responses, 15 subjects with only A responses and 17 
subjects with only B responses. 


a) 


Calculate empirical estimates of 0.4, 0B, p, cå, 7% from the data using 
the commands mean, cor and var. Use all the A responses to get 
64 and 6%, and use all the B responses to get Êg and 63. Use only 
the complete data cases to get . 

For each person i with only an A response, impute a B response as 


Sip = OB + (yia — 9a) ôy 3/83. 


For each person i with only a B response, impute an A response as 


ĝia = 9a + (Yi — bp) py 64/6: 


You now have two “observations” for each individual. Do a paired 
sample t-test and obtain a 95% confidence interval for 04 — 0p. 
Using either Jeffreys’ prior or a unit information prior distribution for 
the parameters, implement a Gibbs sampler that approximates the 
joint distribution of the parameters and the missing data. Compute 
a posterior mean for 04 — 0g as well as a 95% posterior confidence 
interval for 04 — 6g. Compare these results with the results from b) 
and discuss. 
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7.6 Diabetes data: A population of 532 women living near Phoenix, Ari- 
zona were tested for diabetes. Other information was gathered from these 
women at the time of testing, including number of pregnancies, glucose 
level, blood pressure, skin fold thickness, body mass index, diabetes pedi- 
gree and age. This information appears in the file azdiabetes.dat. Model 
the joint distribution of these variables for the diabetics and non-diabetics 
separately, using a multivariate normal distribution: 

a) For both groups separately, use the following type of unit information 
prior, where X is the sample covariance matrix. 

1. Ho = YẸ, Ag = 3; 

ii. So = X, w =pt2=9. 
Generate at least 10,000 Monte Carlo samples for {04, Ya} and 
{0n, Xn}, the model parameters for diabetics and non-diabetics re- 
spectively. For each of the seven variables j € {1,...,7}, compare the 
marginal posterior distributions of #g,; and 6,,;. Which variables seem 
to differ between the two groups? Also obtain Pr(@a,; > 6n,;/Y) for 
each j € {1,...,7}. 

b) Obtain the posterior means of X4 and Xn, and plot the entries versus 
each other. What are the main differences, if any? 


Chapter 8 
8.1 Components of variance: Consider the hierarchical model where 
61,...,9m|p, T? ~ iid. normal(p, 77) 
igoe Ungal o? ~ iid. normal(6;, o°). 


For this problem, we will eventually compute the following: 
Var[yi,j|9i,07], Var[¥.5|0:,07], Cov[yis 3, Ying 18;,.07] 

Var[y; jla, 77], Var[p. jla, T’], Covlyi j, Yiz gli 77] 

First, lets use our intuition to guess at the answers: 

a) Which do you think is bigger, Var[y;,;|@;,07] or Varļ[y; |u, T°]? To 
guide your intuition, you can interpret the first as the variability of 
the Y’s when sampling from a fixed group, and the second as the 
variability in first sampling a group, then sampling a unit from within 
the group. 

b) Do you think Cov[y;i j, 4i.,;|9;,0°] is negative, positive, or zero? An- 
swer the same for Cov[yi, j, Yiz,j|U, T]. You may want to think about 
what Yi, j tells you about y;,,; if 6; is known, and what it tells you 
when 6; is unknown. 

c) Now compute each of the six quantities above and compare to your 
answers in a) and b). 

d) Now assume we have a prior p(j) for u. Using Bayes’ rule, show that 


?] 


p(u|01, EAR Oma Fs ERG Ym) = P(uIA1, nee Om, T’). 
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Interpret in words what this means. 

8.2 Sensitivity analysis: In this exercise we will revisit the study from Exercise 
5.2, in which 32 students in a science classroom were randomly assigned 
to one of two study methods, A and B, with na = ng = 16. After several 
weeks of study, students were examined on the course material, and the 
scores are summarized by {94 = 75.2,54 = 7.3}, {9B = 77.5, sb = 8.1}. 
We will estimate 04 = u + ô and 0g = u — ô using the two-sample model 
and prior distributions of Section 8.1. 

a) Let u ~ normal(75, 100), 1/0? ~ gamma(1, 100) and 6 ~ normal(d, Të). 
For each combination of ôo € {—4, —2, 0, 2,4} and 7@ € {10, 50, 100, 500}, 
obtain the posterior distribution of u, 6 and g? and compute 

i. Pr(d < O/Y); 
ii. a 95% posterior confidence interval for 6; 
iii. the prior and posterior correlation of 04 and 0p. 

b) Describe how you might use these results to convey evidence that 
04 < 0p to people of a variety of prior opinions. 

8.3 Hierarchical modeling: The files school1.dat through school8.dat give 
weekly hours spent on homework for students sampled from eight different 
schools. Obtain posterior distributions for the true means for the eight 
different schools using a hierarchical normal model with the following 
prior parameters: 


fio = 7,96 =5,. e = 10,49 = 2, og = 15,09 = 2. 


a) Run a Gibbs sampling algorithm to approximate the posterior distri- 
bution of {0,07, u, T?}. Assess the convergence of the Markov chain, 
and find the effective sample size for {o7, 1,77}. Run the chain long 
enough so that the effective sample sizes are all above 1,000. 

b) Compute posterior means and 95% confidence regions for {0?, p, T°}. 
Also, compare the posterior densities to the prior densities, and discuss 
what was learned from the data. 

c) Plot the posterior density of R = =5-y and compare it to a plot of the 
prior density of R. Describe the evidence for between-school variation. 

d) Obtain the posterior probability that 07 is smaller than ĝe, as well as 
the posterior probability that 0y is the smallest of all the 6’s. 

e) Plot the sample averages 9,,..., Yg against the posterior expectations 
of 01,...,0g, and describe the relationship. Also compute the sample 
mean of all observations and compare it to the posterior mean of p. 


Chapter 9 


9.1 Extrapolation: The file swim.dat contains data on the amount of time, 
in seconds, it takes each of four high school swimmers to swim 50 yards. 
Each swimmer has six times, taken on a biweekly basis. 


9.2 


9.3 
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a) Perform the following data analysis for each swimmer separately: 

i. Fit a linear regression model of swimming time as the response and 
week as the explanatory variable. To formulate your prior, use the 
information that competitive times for this age group generally 
range from 22 to 24 seconds. 

ii. For each swimmer j, obtain a posterior predictive distribution 
for Y;*, their time if they were to swim two weeks from the last 
recorded time. 

b) The coach of the team has to decide which of the four swimmers will 
compete in a swimming meet in two weeks. Using your predictive dis- 
tributions, compute Pr(Y;* = max{Yj",..., Yi°}/¥)) for each swimmer 
j, and based on this make a recommendation to the coach. 

Model selection: As described in Example 6 of Chapter 7, The file 

azdiabetes.dat contains data on health-related variables of a popula- 

tion of 532 women. In this exercise we will be modeling the conditional 
distribution of glucose level (glu) as a linear combination of the other 
variables, excluding the variable diabetes. 

a) Fit a regression model using the g-prior with g = n, vo = 2 and o? = 1. 
Obtain posterior confidence intervals for all of the parameters. 

b) Perform the model selection and averaging procedure described in Sec- 
tion 9.3. Obtain Pr(3; # Oly), as well as posterior confidence intervals 
for all of the parameters. Compare to the results in part a). 

Crime: The file crime.dat contains crime rates and data on 15 ex- 

planatory variables for 47 U.S. states, in which both the crime rates 

and the explanatory variables have been centered and scaled to have 
variance 1. A description of the variables can be obtained by typing 

library (MASS);?UScrime in R. 

a) Fit a regression model y = X3+e using the g-prior with g = n, vo = 2 
and o@ = 1. Obtain marginal posterior means and 95% confidence 
intervals for G, and compare to the least squares estimates. Describe 
the relationships between crime and the explanatory variables. Which 
variables seem strongly predictive of crime rates? 

b) Lets see how well regression models can predict crime rates based on 
the X-variables. Randomly divide the crime roughly in half, into a 
training set {y,,, Xtr} and a test set {Yte; Xte} 

i. Using only the training set, obtain least squares regression coeffi- 
cients „şs. Obtain predicted values for the test data by computing 
Vos = XteBoig: Plot Hj, versus Ytse and compute the prediction 
error = X (yite M Ui ois)? : 

ii. Now obtain the posterior mean Dira = E[B|y,] using the g-prior 
described above and the training data only. Obtain predictions 
for the test set Ypayes = XtestBpayes: Plot versus the test data, 
compute the prediction error, and compare to the OLS prediction 
error. Explain the results. 
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c) Repeat the procedures in b) many times with different randomly gen- 
erated test and training sets. Compute the average prediction error 
for both the OLS and Bayesian methods. 


Chapter 10 


10.1 Reflecting random walks: It is often useful in MCMC to have a proposal 
distribution which is both symmetric and has support only on a certain 
region. For example, if we know 0 > 0, we would like our proposal distribu- 
tion J(61|99) to have support on positive @ values. Consider the following 
proposal algorithm: 

e sample 0 ~ uniform(ĝo — 6,40 + ô); 

e if <0, set 6, = —ĝ; 

e if 0 > 0, set 0, = 0. 

In other words, 6; = |6|. Show that the above algorithm draws samples 

from a symmetric proposal distribution which has support on positive 

values of 0. It may be helpful to write out the associated proposal density 

J(0,|@9) under the two conditions 0) < ô and ĝo > ô separately. 

10.2 Nesting success: Younger male sparrows may or may not nest during a 
mating season, perhaps depending on their physical characteristics. Re- 
searchers have recorded the nesting success of 43 young male sparrows 
of the same age, as well as their wingspan, and the data appear in the 
file msparrownest.dat. Let Y; be the binary indicator that sparrow i 
successfully nests, and let x; denote their wingspan. Our model for Y; is 
logit Pr(Y; = lla, G,2;) = a+ Bx;, where the logit function is given by 
logit 0 = log[0/(1 — @)]. 

a) Write out the joint sampling distribution [];_, p(y:|a, 6, xi) and sim- 
plify as much as possible. 

b) Formulate a prior probability distribution over a@ and 8 by consid- 
ering the range of Pr(Y = lla, $,x) as x ranges over 10 to 15, the 
approximate range of the observed wingspans. 

c) Implement a Metropolis algorithm that approximates p(a, Bly, x). 
Adjust the proposal distribution to achieve a reasonable acceptance 
rate, and run the algorithm long enough so that the effective sample 
size is at least 1,000 for each parameter. 

d) Compare the posterior densities of a and 8 to their prior densities. 

e) Using output from the Metropolis algorithm, come up with a way to 
make a confidence band for the following function fag(x) of wingspan: 


eatbe 


fanle) = TT carpe’ 


where a and ĝ are the parameters in your sampling model. Make a 
plot of such a band. 


10.3 


10.4 


10.5 
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Tomato plants: The file tplant.dat contains data on the heights of ten 
tomato plants, grown under a variety of soil pH conditions. Each plant 
was measured twice. During the first measurement, each plant’s height 
was recorded and a reading of soil pH was taken. During the second mea- 
surement only plant height was measured, although it is assumed that pH 
levels did not vary much from measurement to measurement. 

a) Using ordinary least squares, fit a linear regression to the data, mod- 
eling plant height as a function of time (measurement period) and pH 
level. Interpret your model parameters. 

b) Perform model diagnostics. In particular, carefully analyze the residu- 

als and comment on possible violations of assumptions. In particular, 

assess (graphically or otherwise) whether or not the residuals within a 

plant are independent. What parts of your ordinary linear regression 
model do you think are sensitive to any violations of assumptions you 
may have detected? 

c) Hypothesize a new model for your data which allows for observations 
within a plant to be correlated. Fit the model using a MCMC approx- 
imation to the posterior distribution, and present diagnostics for your 
approximation. 

d) Discuss the results of your data analysis. In particular, discuss simi- 
larities and differences between the ordinary linear regression and the 
model fit with correlated responses. Are the conclusions different? 

Gibbs sampling: Consider the general Gibbs sampler for a vector of pa- 
rameters @. Suppose °° is sampled from the target distribution p(@) 
and then pth) is generated using the Gibbs sampler by iteratively up- 
dating each component of the parameter vector. Show that the marginal 
probability Pr(@“+) € A) equals the target distribution Japle) do. 
Logistic regression variable selection: Consider a logistic regression model 
for predicting diabetes as a function of xı = number of pregnancies, £2 = 
blood pressure, x3 = body mass index, x4 = diabetes pedigree and x5 = 
age. Using the data in azdiabetes.dat, center and scale each of the x- 
variables by subtracting the sample average and dividing by the sample 
standard deviation for each variable. Consider a logistic regression model 
of the form Pr(Y; = 1|æ;, B, z) = e® /(1 + e%) where 


0i = Bo + 11i 1 + G2Y2%i,2 + B3y3vi,3 + Bayavia + B575%i,5- 


In this model, each y; is either 0 or 1, indicating whether or not variable 
j is a predictor of diabetes. For example, if it were the case that y = 
(1,1,0,0,0), then 6; = Bo + 0124.1 + G2%;,2. Obtain posterior distributions 
for 6B and y, using independent prior distributions for the parameters, 
such that y; ~ binary(1/2), Bo ~ normal(0, 16) and 6; ~ normal(0, 4) for 
each j > 0. 
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a) Implement a Metropolis-Hastings algorithm for approximating the 
posterior distribution of 6B and y. Examine the sequences cg and 


pP x 7 for each 7 and discuss the mixing of the chain. 

b) Approximate the posterior probability of the top five most frequently 
occurring values of y. How good do you think the MCMC estimates 
of these posterior probabilities are? 

c) For each j, plot posterior densities and obtain posterior means for 
Biq- Also obtain Pr(y; = lla, y). 


Chapter 11 


11.1 Full conditionals: Derive formally the full conditional distributions of 
0, X, o? and the G,’s as given in Section 11.2. 

11.2 Randomized block design: Researchers interested in identifying the opti- 
mal planting density for a type of perennial grass performed the following 
randomized experiment: Ten different plots of land were each divided into 
eight subplots, and planting densities of 2, 4, 6 and 8 plants per square me- 
ter were randomly assigned to the subplots, so that there are two subplots 
at each density in each plot. At the end of the growing season the amount 
of plant matter yield was recorded in metric tons per hectare. These data 
appear in the file pdensity.dat. The researchers want to fit a model like 
y = bı + Box + b3x? + €, where y is yield and x is planting density, but 
worry that since soil conditions vary across plots they should allow for 
some across-plot heterogeneity in this relationship. To accommodate this 
possibility we will analyze these data using the hierarchical linear model 
described in Section 11.1. 

a) Before we do a Bayesian analysis we will get some ad hoc estimates 
of these parameters via least squares regression. Fit the model y = 
B+ Box+ 63x? +€ using OLS for each group, and make a plot showing 
the heterogeneity of the least squares regression lines. From the least 
squares coefficients find ad hoc estimates of 8 and X. Also obtain an 
estimate of o? by combining the information from the residuals across 
the groups. 

b) Now we will perform an analysis of the data using the following dis- 
tributions as prior distributions: 


X! ~ Wishart(4, 0-1) 
6 ~ multivariate normal(8, £) 
a° ~ inverse — gamma(1, 6”) 
where ĝ, £, ô? are the estimates you obtained in a). Note that this 


analysis is not combining prior information with information from 
the data, as the“prior” distribution is based on the observed data. 


11.3 


11.4 
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However, such an analysis can be roughly interpreted as the Bayesian 
analysis of an individual who has weak but unbiased prior information. 

c) Use a Gibbs sampler to approximate posterior expectations of 3 for 
each group j, and plot the resulting regression lines. Compare to the 
regression lines in a) above and describe why you see any differences 
between the two sets of regression lines. 

d) From your posterior samples, plot marginal posterior and prior densi- 
ties of 0 and the elements of X. Discuss the evidence that the slopes 
or intercepts vary across groups. 

e) Suppose we want to identify the planting density that maximizes av- 
erage yield over a random sample of plots. Find the value £max of x 
that maximizes expected yield, and provide a 95% posterior predic- 
tive interval for the yield of a randomly sampled plot having planting 
density Tmax. 

Hierarchical variances:. The researchers in Exercise 11.2 are worried that 

the plots are not just heterogeneous in their regression lines, but also 

in their variances. In this exercise we will consider the same hierarchi- 
cal model as above except that the sampling variability within a group 
is given by yi; ~ normal((@1,; + 62,;21,; + 63,527 5,07), that is, the vari- 
ances are allowed to differ across groups. As in Section 8.5, we will model 

o7,...,02, ~ iid. inverse gamma(vo/2, vooĝ/2), with of ~ gamma(2, 2) 

and p(vo) uniform on the integers {1,2,..., 100}. 

a) Obtain the full conditional distribution of o@. 


b) Obtain the full conditional distribution of 03. 

c) Obtain the full conditional distribution of 6;. 

d) For two values vf? and 1% of vo, obtain the ratio p(vòloĝ, o?,..., 02) 
divided by p(y lož, o7,---,02,), and simplify as much as possible. 


e) Implement a Metropolis-Hastings algorithm for obtaining the joint 
posterior distribution of all of the unknown parameters. Plot values 
of o@ and vp versus iteration number and describe the mixing of the 
Markov chain in terms of these parameters. 

f) Compare the prior and posterior distributions of vp. Comment on any 
evidence there is that the variances differ across the groups. 

Hierarchical logistic regression: The Washington Assessment of Student 

Learning (WASL) is a standardized test given to students in the state of 

Washington. Letting j index the counties within the state of Washington 

and 7 index schools within counties, the file mathstandard.dat includes 

data on the following variables: 
Yi,j = the indicator that more than half the 10th graders in school 7, 7 
passed the WASL math exam; 
Zij = the percentage of teachers in school ¿i,j who have a masters 
degree. 

In this exercise we will construct an algorithm to approximate the pos- 

terior distribution of the parameters in a generalized linear mixed-effects 
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model for these data. The model is a mixed effects version of logistic 
regression: 

Yigg ~ binomial(e%.3 /[1 + evs), where Ving = Bog + Bi 5%ig 

Bis- -+ BJ ~ iid. multivariate normal (0, X), where 8; = (boj, 1,3) 

a) The unknown parameters in the model include population-level pa- 
rameters {0, X} and the group-level parameters {3,,...,3,,,}. Draw 
a diagram that describes the relationships between these parameters, 
the data {yi j, Zij i =1... nj, j =1,...,m}, and prior distributions. 

b) Before we do a Bayesian analysis, we will get some ad hoc estimates 
of these parameters via maximum likelihood: Fit a separate logistic 
regression model for each group, possibly using the glm command 
in R via beta.j <— glm(y.j ~ X.j,family=binomial)$coef . Explain any 
problems you have with obtaining estimates for each county. Plot 
exp{ ĝoj + Êi jz}/ + exp{ ĝo; + Âi jz}) as a function of x for each 
county and describe what you see. Using maximum likelihood esti- 
mates only from those counties with 10 or more schools, obtain ad 
hoc estimates 0 and X of 0 and X. Note that these estimates may not 
be representative of patterns from schools with small sample sizes. 

c) Formulate a unit information prior distribution for 0 and X based on 
the observed data. Specifically, let @ ~ multivariate normal(ĝ, £) and 
let S-! ~ Wishart(4, $-1). Use a Metropolis-Hastings algorithm to 
approximate the joint posterior distribution of all parameters. 

d) Make plots of the samples of 0 and X (5 parameters) versus MCMC 
iteration number. Make sure you run the chain long enough so that 
your MCMC samples are likely to be a reasonable approximation to 
the posterior distribution. 

e) Obtain posterior expectations of 3, for each group j, plot E[Go,;|y] + 
E[G1,;|y]z as a function of x for each county, compare to the plot in 
b) and describe why you see any differences between the two sets of 
regression lines. 

f) From your posterior samples, plot marginal posterior and prior den- 
sities of 0 and the elements of X. Include your ad hoc estimates from 
b) in the plots. Discuss the evidence that the slopes or intercepts vary 
across groups. 

11.5 Disease rates: The number of occurrences of a rare, nongenetic birth defect 
in a five-year period for six neighboring counties is y = (1,3,2,12,1,1). 
The counties have populations of x = (33, 14,27, 90, 12,17), given in thou- 
sands. The second county has higher rates of toxic chemicals (PCBs) 
present in soil samples, and it is of interest to know if this town has a 
high disease rate as well. We will use the following hierarchical model to 
analyze these data: 

e Y;|0;,2; ~ Poisson(6;2;); 
e 0i,..., Osla, b ~ gamma(a, b); 
e am~ gamma(1,1) ; b~ gamma(10,1). 
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a) Describe in words what the various components of the hierarchical 
model represent in terms of observed and expected disease rates. 

b) Identify the form of the conditional distribution of p(@1,..., Osla, b, x, 
y), and from this identify the full conditional distribution of the rate 
for each county p(6;|O_;, a,b, x, y). 

c) Write out the ratio of the posterior densities comparing a set of pro- 
posal values (a*,b*,@) to values (a,b,@). Note the value of 0, the 
vector of county-specific rates, is unchanged. 

d) Construct a Metropolis-Hastings algorithm which generates samples 
of (a,b, 0) from the posterior. Do this by iterating the following steps: 

1. Given a current value (a,b,@), generate a proposal (a*,b*,@) by 
sampling a* and b* from a symmetric proposal distribution cen- 
tered around a and b, but making sure all proposals are posi- 
tive (see Exercise 10.1). Accept the proposal with the appropriate 
probability. 

2. Sample new values of the 6;’s from their full conditional distribu- 
tions. 

Perform diagnostic tests on your chain and modify if necessary. 

e) Make posterior inference on the infection rates using the samples from 
the Markov chain. In particular, 

i. Compute marginal posterior distributions of 0,,...,05 and com- 
pare them to y1/%1,...y6/X6.- 

ii. Examine the posterior distribution of a/b, and compare it to the 
corresponding prior distribution as well as to the average of y;/2; 
across the six counties. 

iii. Plot samples of 02 versus 0; for each j # 2, and draw a 45 de- 
gree line on the plot as well. Also estimate Pr(@2 > 6;|a,y) for 
each j and Pr(0 = max{6;,...,0¢}|x,y). Interpret the results 
of these calculations, and compare them to the conclusions one 
might obtain if they just examined y,;/x,; for each county j. 


Chapter 12 


12.1 Rank regression: The 1996 General Social Survey gathered a wide vari- 
ety of information on the adult U.S. population, including each survey 
respondent’s sex, their self-reported frequency of religious prayer (on a 
six-level ordinal scale), and the number of items correct out of 10 on a 
short vocabulary test. These data appear in the file prayer.dat. Using 
the rank regression procedure described in Section 12.1.2, estimate the 
parameters in a regression model for Y;= prayer as a function of 2; = 
sex of respondent (0-1 indicator of being female) and x;,2 = vocabulary 
score, as well as their interaction x;,3 = 2,1 X £i 2. Compare marginal 
prior distributions of the three regression parameters to their posterior 
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distributions, and comment on the evidence that the relationship between 
prayer and score differs across the sexes. 
Copula modeling: The file azdiabetes_alldata.dat contains data on 
eight variables for 632 women in a study on diabetes (see Exercise 7.6 
for a description of the variables). Data on subjects labeled 201-300 have 
missing values for some variables, mostly for the skin fold thickness mea- 
surement. 

a) Using only the data from subjects 1-200, implement the Gaussian 
copula model for the eight variables in this dataset. Obtain posterior 
means and 95% posterior confidence intervals for all (3) = 28 param- 
eters. 

b) Now use the data from subjects 1-300, thus including data from sub- 
jects who are missing some variables. Implement the Gaussian copula 
model and obtain posterior means and 95% posterior confidence in- 
tervals for all parameters. How do the results differ from those in a)? 

Constrained normal: Let p(z) x dnorm(z, 0, o) X (a,b) (2), the normal den- 

sity constrained to the interval (a,b). Prove that the inverse-cdf method 
outlined in Section 12.1.1 generates a sample from this distribution. 

Categorical data and the Dirichlet distribution: Consider again the data 

on the number of children of men in their 30s from Exercise 4.8. These 
data could be considered as categorical data, as each sample Y lies in the 
discrete set {1,...,8} (8 here actually denotes “8 or more” children). Let 
0a = (041,---,04,8) be the proportion in each of the eight categories 
from the population of men with bachelor’s degrees, and let the vector 0p 
be defined similarly for the population of men without bachelor’s degrees. 

a) Write in a compact form the conditional probability given 04 of ob- 
serving a particular sequence {y41,---,;YAm,} for a random sample 
from the A population. 

b) Identify the sufficient statistic. Show that the Dirichlet family of dis- 
tributions, with densities of the form p(@|a) « 097} x -.- 04571, are 
a conjugate class of prior distributions for this sampling model. 

c) The function rdir() below samples from the Dirichlet distribution: 


rdir<function(nsamp=l,a) # a is a vector 


{ 
Z<matrix( rgamma(length(a)*nsamp,a,1) , 
nsamp, length (a) , byrow=T) 
Z/apply (Z,1 ,sum) 


} 


Using this function, generate 5,000 or more samples of 0 4 and 0 g from 
their posterior distributions. Using a Monte Carlo approximation, ob- 
tain and plot the posterior distributions of E[Y4|@.4] and E|Yp]0B], 
as well as of Ya and Yp. 
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d) Compare the results above to those in Exercise 4.8. Perform the good- 
ness of fit test from that exercise on this model, and compare to the 
fit of the Poisson model. 


Common distributions 


The binomial distribution 


A random variable X € {0,1,...,n} has a binomial(n, 0) distribution if 0 € 
[0,1] and 


Pr(X = z|0,n) = C) 6"(1—6)"—" for x € {0,1,... n}. 
Hh 


For this distribution, 


E[X|6] = nð, 
Var[X|0] = n0(1 — 6), 
mode[X |4] = |(n+1)@], 
p(x|0,n) =  dbinom(x,n,theta) . 


If Xı ~ binomial(n,,9) and Xə ~ binomial(n2,0) are independent, then 
X = Xi +X ~ binomial(n;+n2, 8). When n = 1 this distribution is called the 
binary or Bernoulli distribution. The binomial(n, 0) model assumes that X is 
(equal in distribution to) a sum of independent binary(@) random variables. 


The beta distribution 


A random variable X € [0,1] has a beta(a, b) distribution if a > 0, b > 0 and 


p(ala, b) = gi—g)" for 0<2¢< 1, 
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For this distribution, 


a 
E[X|a, b| = 
[ja] a+b’ 
Var[X|a, b] = e aH xB ise 
"Aol = abt itd? a+b+1’ 
—1 
mode[X |a, b] = as ifa>landb>1, 


(a—1)+ (6-1) 
p(x|a,b) =  dbeta(x,a,b) . 
The beta distribution is closely related to the gamma distribution. See the 


paragraph on the gamma distribution below for details. A multivariate version 
of the beta distribution is the Dirichlet distribution, described in Exercise 12.4. 


The Poisson distribution 
A random variable X € {0,1,2,...} has a Poisson(6) distribution if 0 > 0 and 
Pr(X = 2/6) = 0e? /x! for x € {0,1,2,...}. 


For this distribution, 


If Xı ~ Poisson(@,) and X2 ~ Poisson(62) are independent, then Xı + X2 ~ 
Poisson(@,+62). The Poisson family has a “mean-variance relationship,” which 
describes the fact that E[X|6] = Var[X|6] = 0. If it is observed that a sample 
mean is very different than the sample variance, then the Poisson model may 
not be appropriate. If the variance is larger than the sample mean, then a 
negative binomial model (Section 3.2.1) might be a better fit. 


The gamma and inverse-gamma distributions 


A random variable X € (0,00) has a gamma(a, b) distribution if a > 0, b > 0 
and 


b2 
p(ala, b) = T pte forg > 0, 


For this distribution, 
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E[X|a, b] = a/b, 
Var[X |a, b] = a/b”, 
mode[X|a,b] = (a — 1)/b ifa > 1,0 if0<a<1, 
p(x|a,b) = dgamma(x,a,b) . 


If Xı ~ gamma(aı,b) and Xı ~ gamma(az2,b) are independent, then Xı + 
Xə ~ gamma(aı+a2, b) and X1/(X1+X2) ~ beta(aı, az). If X ~ normal(0, o?) 
then X? ~ gamma(1/2, 1/[20°]). The chi-square distribution with v degrees 
of freedom is the same as a gamma(v/2,1/2) distribution. 


A random variable X € (0,co) has an inverse-gamma(a,b) distribution if 
1/X has a gamma(a, b) distribution. In other words, if Y ~ gamma(a, b) and 
X =1/Y, then X ~ inverse-gamma(a, b). The density of X is 


a 


b 
p(z|a,b) = T at te-/™ for æ > 0. 


For this distribution, 


E[X|a,b] = b/(a — 1) ifa > 1, œ if0<a< 1, 
Var[X|a, b] = b?/[(a — 1)?(a — 2)] if a > 2, œ if 0 < a < 2, 
mode[X |a, b] = b/(a +1). 
Note that the inverse-gamma density is not simply the gamma density with 


x replaced by 1/z: There is an additional factor of x7? due to the Jacobian 
in the change-of-variables formula (see Exercise 10.3). 


The univariate normal distribution 


A random variable X € R has a normal(6, ø?) distribution if o? > 0 and 


p(x|9,07) = 272 ~2(#-9)"/0? for 00 < £ < o0. 
TO 
For this distribution, 
E[X|9, o°] = 9, 
Var[X|0,o?] = o’, 
mode[X|0, o°] = 0 
) 


= dnorm(x,theta,sigma) . 


Remember that R parameterizes things in terms of the standard deviation 
o, and not the variance o°. If X; ~ normal(01, 0?) and Xz ~ normal(62, 03) 
are independent, then aX; + bX2 + c ~ normal(a@; + b02 + c, a?o? + b?o2). 
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A normal sampling model is often useful even if the underlying population 
does not have a normal distribution. This is because statistical procedures 
that assume a normal model will generally provide good estimates of the 
population mean and variance, regardless of whether or not the population is 
normal (see Section 5.5 for a discussion). 


The multivariate normal distribution 


A random vector X € R? has a multivariate normal(@, X) distribution if X 
is a positive definite p x p matrix and 


p(a|@, X) = (2r)? |X| 1 exp -5e 6) Se o} for æ € R?. 


For this distribution, 


E[X|6, X| = 90, 
Var[X|0, X] = X, 
mode[X]|90, X] = 0. 


Just like the univariate normal distribution, if Xı ~ normal(0ı, X1) and 
Xə ~ normal(02, X2) are independent, then aXı +bX2 +c ~ normal(aðı + 
bð + c,a? X; + b? X2). Marginal and conditional distributions of subvec- 
tors of X also have multivariate normal distributions: Let a C {1,...,p} 
be a subset of variable indices, and let b = aĉ be the remaining indices. 
Then X jg} ~ multivariate normal(@jq), Xla,a]) and {X p| X aj} ~ multivariate 
normal(Osja;,Xbja), where 


Obja = Oiv) + Pipa Dae) (X tal a Olaj) 
Lola = Zlb,b] — Xib a] (Fla,a]) 1 Zla,- 


Simulating a multivariate normal random variable can be achieved by a linear 
transformation of a vector of i.i.d. standard normal random variables. If Z 
is the vector with elements Z1,..., Zp ~ i.i.d. normal(0,1) and AAT = 5, 
then X = 0 + AZ ~ multivariate normal(0, X). Usually A is the Choleski 
factorization of X. The following R-code will generate an n x p matrix such 
that the rows are i.i.d. samples from a multivariate normal distribution: 


Z<—matrix (rnorm(nx*p) ,nrow=n, ncol=p) 
X<-t( t(Z%%chol(Sigma)) + c(theta) ) 
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The Wishart and inverse-Wishart distributions 


A random p x p symmetric positive definite matrix X has a Wishart(v,M) 
distribution if the integer v > p, M is a p x p symmetric positive definite 
matrix and 


p(X|v,M) = [22T 0/2 M|] x X|” detr -M7*X/2), 


where 


Tp(v/2) = -9/4 Ti- T +1- 5)/2], and 
etr(A) = exp(}_ a;,;), the exponent of the sum of the diagonal elements. 


For this distribution, 


E[X|v, M] = vM, 
Var[X; ;|v, M] =VxX (mj + Miimii) 
mode[X|v, M] = (v — p — 1)M. 


The Wishart distribution is a multivariate version of the gamma distribu- 
tion. Just as the sum of squares of i.i.d. univariate normal variables has a 
gamma distribution, the sums of squares of i.i.d. multivariate normal vectors 
has a Wishart distribution. Specifically, if Y1,..., Y, ~ iid. multivariate 
normal(0, M), then X Y; YT ~ Wishart(v, M). This relationship can be used 
to generate a Wishart-distributed random matrix: 

Z<— matrix (rnorm (nu*p),nrow=nu, ncol=p) # standard normal 


Y<-Z%*%c hol (M) # rows have cov=M 
X<—t (Y)%*%Y # Wishart matrix 


A random p x p symmetric positive definite matrix X has an inverse- 
Wishart(v,M) distribution if X~' has a Wishart(v,M) distribution. In other 
words, if Y ~ Wishart(v,M) and X = Y~}, then X ~ inverse-Wishart(v,M). 
The density of X is 


p(X|v,M) = [2P r w2 MP] x [X|-@ +?) Petr(—M*X71/2). 
For this distribution, 


B[X|y,M] = (v -p-1) M7, 
mode[X|v, M] = (v +p + 1) M~}. 


The second moments (i.e. the variances) of the elements of X are given in 
Press (1972). Since we often use the inverse-Wishart distribution as a prior 
distribution for a covariance matrix X, it is sometimes useful to parameterize 
the distribution in terms of S = M7". Then if X ~ inverse-Wishart(v,S~'), 
we have mode[X|v, S] = (v + p+1)71S. If Xo were the most probable value 
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of X a priori, then we would set S = (vo + p + 1)Xo, so that X ~ inverse- 
Wishart(v, [(v + p — 1)X]~*) and mode[2|v, S] = Xo. 


For more on the Wishart distribution and its relationship to the multivariate 
normal distribution, see Press (1972) or Mardia et al (1979). 
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random variables, 27 identifiability, lack of, 218 
expectation, 21 imputation, 115 
exponential family, 51, 83, 229 independence 

events, 17 

Fisher information, 231 random variables, 26 

observed, 231 interquartile range, 22 
fixed effect, 147, 197 inverse, of a matrix, 106 
frequentist coverage, 41 inverse-gamma distribution, 74, 254 
full conditional distribution, 93, 225 inverse-Wishart distribution, 110, 257 


Jeffreys’ prior, see prior distribution 


gamma distribution, 45, 254 joint distribution, 23 


gamma function, 33 
generalized least squares, 189 
generalized linear model, 171 
linear predictor, 172 
link function 
log link, 172 
logit link, 173 
logistic regression, 172 
mixed effects model, 247 
variable selection, 245 


lasso, 10 
latent variable model, 211 
likelihood, 231 
maximum likelihood estimator, 231 
rank, 214 
linear mixed effects model, 197 
linear regression, 149 
g-prior, 157 
j : Bayesian estimation, 154 
Poisson regression, 172 complexity penalty, 166 
. mixed effects model, 203 generalized least squares, 189 
Gibbs sampler, 93 hierarchical, 195, 197 


properties, 96, 245 . model averaged estimate, 169 
with Metropolis algorithm, 187 model averaging, 167 


graphical model, 123, 222 model selection, 160, 243 

group comparisons 
multiple groups, 130 
two groups, 125 


normal model, 151 

ordinary least squares, 153 
polynomial regression, 203 
relationship to multivariate normal 


hierarchical data, 130 model, 121 

hierarchical model unit information prior, 156 
fixed effect, 197 weakly informative prior, 155 
for population means, 132 log-odds, 57 
for population variances, 143 logistic regression, 172 
logistic regression, 247 mixed effects model, 247 
mixed effects model, 195 variable selection, 245 
normal model, 132 logit function, 173 
normal regression, 197 
Poisson regression, 203 marginal likelihood, 223 
Poisson-gamma model, 248 Markov chain, 96 


random effect, 197 aperiodic, 185 
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Ergodic theorem, 185 odds, 57 
irreducible, 185 ordered probit regression, 211 
recurrent, 185 ordinal variables, 210 
Markov chain Monte Carlo (MCMC), ordinary least squares (OLS), 153 
97 out-of-sample validation, 122, 161, 243 
autocorrelation, 100, 178, 237, 238 
burn-in, 178 p-value, 126 
comparison to Monte Carlo, 99 parameter expansion, 219 
convergence, 101 parameter space, 2 
effective sample size, 103 partition, 14 
mixing, 102 point estimator, 79 
stationarity, 101 Poisson distribution, 19, 43, 254 
thinned chain, 181, 191 Poisson regression, 172 
matrix inverse, 106 mixed effects model, 203 
matrix trace, 110 polynomial regression, 203 
matrix transpose, 106 positive definite matrix, 109 
matrix, positive definite, 109 posterior approximation, see Markov 
maximum likelihood, 231 chain Monte Carlo (MCMC) 
mean, 21 discrete approximation, 90, 173 
mean squared error (MSE), 81 posterior distribution, 2 
median, 21 precision, 71 
Metropolis algorithm, 175 precision matrix, 110 
acceptance ratio, 175 predictive distribution, 40 
with Gibbs sampler, 187 posterior, 61 
Metropolis-Hastings algorithm, 181 prior, 61, 239 
acceptance ratio, 183 sampling from, 60 
combining Gibbs and Metropolis, 187 prior distribution, 2 
stationary distribution, 186 conjugate, 38, 51, 83 
missing data, 115 mixtures of, 228, 229, 233 
missing at random, 116, 221 improper, 78, 231 
mixed effects model, see hierarchical Jeffreys’, 231, 236, 238 
model unit information, 156, 200, 231, 236, 
mixture model, 234, 237 238, 248 
mode, 21 weakly informative, 52, 84 
model, see distribution probability 
model averaging, 167 axioms of, 14 
model checking, 62, 232, 235 density, 18 
model selection distribution, 18 
linear regression, 160, 243 interpretations of, 226 
logistic regression, 245 probability density function (pdf), 18 
model, sampling, 2 probit regression, 237 
Monte Carlo approximation, 54 ordered, 211 
Monte Carlo standard error, 56 proposal distribution, 175 
multilevel data, 130 reflecting random walk, 190, 244 
multivariate normal distribution, 106, 
256 quantile, 22 
negative binomial distribution, 48 random effect, 147, 197 


normal distribution, 20, 67, 255 random variable, 17 
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continuous, 19 

discrete, 18 
randomized block design, 246 
rank likelihood, 214 
reflecting random walk, 190, 244 


sample autocorrelation function, 103 
sample space, 2 
sampling model, 2 
sampling properties, 80 
sensitivity analysis, 5, 227, 236, 242 
shrinkage, 140 
standard deviation, 22 
sufficient statistic, 35, 83 
binomial model, 35 
exponential family, 51, 83 
normal model, 70 
Poisson model, 45 
sum of squares matrix, 109 


t-statistic 


relationship to an improper prior 
distribution, 79 
two sample, 125 
trace of a matrix, 110 
training and test sets, 161 
transformation model, 211, 214 
transformation of variables, see change 
of variables 
transpose, of a matrix, 106 


unbiased, 80 

uniform distribution, 32 

unit information prior, see prior 
distribution 


variable selection, see model selection 
variance, 22 


Wald interval, 7 

weakly informative prior, see prior 
distribution 

Wishart distribution, 109, 257 


